Probabilistic  Counterfactuals: 
Semantics,  Computation,  and  Applications 


Alexander  A.  Balke 
Judea  Pearl 


Final  Technical  Report 
Submitted  to 

U.S.  Air  Force  /  Office  of  Scientific  Research 
F-49620-93- 1-0421 
7/1/93  -  6/30/96 


fej  pabiij^ 


This  report  is  based  on  A.  Bailee’s  dissertation  submitted  to  UCLA  in  partial 
satisfaction  of  the  requirements  for  the  degree  of  Doctor  of  Philosophy  in  Computer 
Science. 


1 


19971204  182 


•  oz?  rT\  uuuM  oruiNOur.[i.iJ  r\c:cir.'wnoiu  ^luo  i  -> 


T  i  WC^ 


REPORT  documentation  PAGE 


tom  apy  pwm 
0MB  NO.  C704-B1BB 


BUT*** w'SnSlISow. 

l-r— -  ./3^ 

■  '  '  *  '  '  t»  itnM»t»5 

*.  TITLt  AND  SUSTITLI  > 

Tm^:  Dyn^c  Netwp^s  Technd^ufe«  for  Aut^omfc^^  Planting 

and  Coh^ol  j 

SUBTITLEYv^ffDbabil>la/ic  Coubt^f actual'^^ — / 

C.  AUTHOR(S)  ^  „ 

Professor  Judea  Pearl 

7.  PERFORMINS  OROANttATlOK  MAME(S)  ANO  AODRESS(£SJ 

UCLA 

Computer  Science  Department 
4532  Boelter  Hall 
Los  Angeles,  CA  90095-1596 


9.  SPONSORING/MONITORING  AGENCY  NAM»:>)  AND  AOORtSSlES) 

U.S.  Air  Force  /  Office  of  Scientific  Research 

110  Duncan  Avenue,  Suite  B115 

Bolling  Air  Force  Base 

Washington,  DC  20332-0001  ' 


tt.  SUPPtEMENTARr  NOTIS 


G  -  F49620-93-1-0421 

<=il 

o  (oaQ 


7.  performing  ORGANIZATION 
REPORT  MUMiER 

A87-1670F-01 

442510-22525 


fl».  distribution /AVAtLASlUTY  STATEMENT 


12b.  OlSTRIBUTtOM  CODE 


Approved  for  public  release, 
is  unlimited. 


Distribution 


T3.  AtSTRACT  (Mmmum  JOO  wormi 

We  have  reformulated  Bayesian  networks  as  carriers  of  causal  information.  The  result 
is  a  more  natural  understanding  of  what  the  networks  stand  for,  what  judgments  re 

quired  in  constructing  the  network  and,  most  importantly,  how  actions  an^  p  ans 
handled  within  the  framework  of  standard  probability  theory. 

description  of  physical  mechanisms,  we  were  able  to  derive  the  standard  probabilisti 

oroDerties  of  Bayesian  networks  and  to  show:  ^  i  - 

*  how  the  effects  of  unanticipated  actions  can  be  predicted  from  the  network  topology, 

*  how  qualitative  causal  jedgments  can  be  integrated  with  statistica  a  a, 

*  how  actions  interact  with  observations,  and 

*  how  counterf actuals  sentences  can  be  formulated  and  evaluated. 


UTTURJECT  TERMS 


15.  number  of  pages 


.  1 15.  PRICE  CODE 

Keywords:  Causation,  counterf actuals ,  Bayesian  networks  _ 

17  SECURJTY  OASSIFKJITION  J  IB.  SECUWTY  GLASSIFICATION  j  15.  5ECOR*^^S5IF»CAT10N  “  20.  UMfTATlON  OF  ABSTRACT 
^REPORT  OF  TUB  PAGE  OP  AiiTRACT 


unclassified 


Of  Tnrs  MGE 

unclassified 


OF  Aei^TIUCT 

unclassified 


Stancfarc  ^orm  298  (Rev  2-89) 


Abstract 


Counterfactual  conditionals  of  the  form  “If  A  were  true,  then  C”  are  com¬ 
monly  used  to  express  generic,  law-like  relationships.  This  dissertation  provides 
formal  semantics  for  interpreting  such  conditionals,  as  well  as  computational 
methods  for  answering  queries  of  the  form  “Find  the  probability  of  C  if  A  were 
true,  given  that  A  is  in  fact  false.”  Here,  generic  knowledge  is  represented  as  a 
network  of  causal  relationships  among  variables  of  interest,  while  specific  occur¬ 
rences  are  represented  as  instantiations  of  those  variables.  The  counterfactual 
antecedent  A  is  interpreted  as  a  local,  hypothetical  change  induced  by  forces  ex¬ 
ternal  to  the  system.  Counterfactual  probabilities  are  computed  using  standard 
evidence  propagation  in  two  loosely  coupled  Bayesian  networks  —  one  corre¬ 
sponding  to  the  factual  world,  the  other  to  the  counterfactual  —  where  the  prob¬ 
abilities  are  defined  over  the  causal  mechanisms  governing  the  domain.  When 
such  probabilities  are  not  available,  we  develop  methods  for  computing  either 
bounds  on  the  counterfactual  probabilities  or  qualitative  beliefs,  i.e.,  order-of- 
magnitude  abstractions  of  standard  probabilities. 

We  then  demonstrate  the  usefulness  of  our  formulation  in  application  areas 
where  counterfactual  reasoning  is  essential  but  considered  difficult,  if  not  im¬ 
possible,  to  compute.  First,  we  examine  experimental  studies  in  which  subjects 
do  not  comply  perfectly  with  treatment  assignment,  thus  violating  the  tenets 
of  randomized  experimentation.  We  show  that  it  is  possible  in  such  studies  to 
derive  informative  bounds  on  treatment  efficacy,  tighter  than  any  yet  reported 
in  the  statistical  or  the  epidemiological  literature.  Next,  we  address  the  problem 
of  determining  legal  responsibility  (e.g.,  whether  the  defendant  is  liable  for  the 
plaintiff ^s  injuries).  Although  counterfactual  assertions  in  this  domain  cannot  be 
evaluated  using  conventional  statistical  analysis,  under  our  formalism  they  can 
be  assigned  meaningful  probability  intervals.  In  the  areas  of  econometrics  and 
the  social  sciences,  the  formalism  allows  coherent  evaluation  of  policies  involving 
the  control  of  variables  that,  prior  to  enacting  a  given  policy,  were  influenced 
by  other  variables  in  the  system.  Finally,  in  the  area  of  artificial  intelligence, 
the  formulation  provides  a  computational  model  for  interpreting  counterfactual 
utterances,  answering  counterfactual  queries,  and  evaluating  actions  and  plans. 
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CHAPTER  1 


C  ount  er  fact  uals 


1.1  Introduction 

A  counterfactual  conditional  has  the  form 

If  A  were  true,  then  C  would  have  been  true 

where  A,  the  counterfactual  antecedent,  specifies  an  event  that  is  contrary  to 
one’s  real-world  observations,  and  C,  the  counterfactual  consequent,  specifies  a 
result  that  is  expected  to  hold  in  the  alternative  world  where  the  antecedent 
is  true.  A  typical  instance  is  “If  Oswald  were  not  to  have  shot  Kennedy,  then 
Kennedy  would  still  be  alive”  which  presumes  the  factual  knowledge  of  Oswald’s 
shooting  Kennedy,  contrary  to  the  antecedent  of  the  sentence. 

The  majority  of  the  philosophers  who  have  examined  the  semantics  of  coun¬ 
terfactual  sentences  [Goo83,  HSP81,  Nut80,  cou93]  have  resorted  to  some  form 
of  logic  based  on  worlds  that  are  “closest”  to  the  real  world  yet  consistent  with 
the  counterfactual’s  antecedent.  Ginsberg  [Gin86],  following  a  similar  strategy, 
suggested  that  the  logic  of  counterfactuals  could  be  applied  to  problems  in  plan¬ 
ning  and  diagnosis  in  Artificial  Intelligence.  The  few  other  papers  in  AI  that  have 
focussed  on  counterfactual  sentences  (e.g.,  [Jac89,  PAA91,  Bou92,  Gra91])  have 
mostly  adhered  to  logics  based  on  the  “closest  world”  approach. 

In  the  real  world,  we  seldom  have  adequate  information  for  verifying  the  truth 
of  an  indicative  sentence,  much  less  the  truth  of  a  counterfactual  sentence.  Ex¬ 
cept  for  the  small  set  of  relationships  between  variables  which  can  be  modeled 
by  physical  laws,  most  of  the  relationships  in  one’s  knowledge  base  are  nonde- 
terministic.  Therefore,  it  is  more  practical  to  ask  not  for  the  truth  or  falsity  of 
a  counterfactual,  but  for  one’s  degree  of  belief  in  the  counterfactual  consequent 
given  the  antecedent.  To  account  for  such  uncertainties,  [Lew76]  has  generalized 
the  notion  of  “closest  world”  using  the  device  of  “imaging”;  namely,  the  closest 
worlds  are  assigned  probability  scores,  and  these  scores  are  combined  to  compute 
the  probability  of  the  consequent. 
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Missing  from  the  “closest  world”  approach  is  a  precise  specification  of  the 
closeness  measure  itself,  which  is  critical  to  the  analysis  of  counterfactuals.  More 
specifically,  it  does  not  tell  us  how  to  encode  distances  in  a  way  that  would  (1) 
conform  to  our  perception  of  causal  influences  and  (2)  lend  itself  to  economical 
machine  representation.  This  dissertation  will  provide  a  concrete  explication  of 
the  closest  world  approach,  one  that  satisfies  the  two  requirements  above. 

The  target  of  this  investigation  are  counterfactual  queries  of  the  form: 

If  A  were  true,  then  what  is  the  probability  that  C  would  have  been 
true,  given  that  we  know  B1 

The  proposition  B  stands  for  the  actual  observations  made  in  the  real  world 
(e.g.,  that  Oswald  did  shoot  Kennedy  and  that  Kennedy  is  dead)  which  are  made 
explicit  to  facilitate  the  analysis. 

Counterfactuals  are  intertwined  with  notions  of  causality:  We  do  not  typi¬ 
cally  express  counterfactual  conditionals  without  assuming  a  causal  relationship 
between  the  counterfactual  antecedent  and  the  counterfactual  consequent.  For 
example,  we  can  safely  state  “If  the  sprinkler  were  on,  the  grass  would  be  wet”, 
but  the  contrapositive  form  of  the  same  sentence  in  counterfactual  form,  “If  the 
grass  were  dry,  then  the  sprinkler  would  not  be  on”,  strikes  us  as  strange,  be¬ 
cause  we  do  not  think  the  state  of  the  grass  has  causal  influence  on  the  state 
of  the  sprinkler.  Likewise,  we  do  not  state  “All  blocks  on  this  table  are  green, 
hence,  had  this  white  block  been  on  the  table,  it  would  have  been  green”.  In  fact, 
we  could  say  that  people’s  use  of  counterfactual  conditionals  is  aimed  precisely 
at  conveying  generic  causal  information,  uncontaminated  by  specific,  transitory 
observations,  about  the  real  world.  Observed  facts  often  do  reflect  strange  combi¬ 
nations  of  rare  eventualities  (e.g.,  all  blocks  being  green)  that  have  nothing  to  do 
with  general  traits  of  influence  and  behavior.  The  counterfactual  sentence,  how¬ 
ever,  emphasizes  the  law-like,  necessary  component  of  the  relation  considered.  It 
is  for  this  reason,  we  speculate,  that  we  find  such  frequent  use  of  counterfactuals 
in  ordinary  discourse. 

The  importance  of  equipping  machines  with  the  capability  to  answer  counter- 
factual  queries  lies  precisely  in  this  causal  reading.  By  making  a  counterfactual 
query,  the  user  intends  to  extract  the  generic,  necessary  connection  between  the 
antecedent  and  consequent,  regardless  of  the  contingent  factual  information  avail¬ 
able  at  that  moment. 

Although  some  philosophers  consider  the  analysis  of  counterfactuals  where 
no  causal  information  is  available  (e.g.,  the  “All  blocks  on  the  table  are  green” 
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example),  these  will  not  be  treated  in  this  dissertation.  The  interpretation  of 
counterfactuals  presented  here  relies  on  a  strict  separation  of  generic  background 
causal  knowledge  and  transient  observations  of  the  world.  The  transient  obser¬ 
vations  (e.g.,  “All  blocks  on  the  table  are  green”)  may  not  be  used  as  inference 
rules;  only  the  generic  causal  knowledge  may  be  used  for  inferring  beliefs  from  ob¬ 
servations.  [Goo83]  stresses  the  importance  of  distinguishing  causal  information 
from  observed  facts: 

Though  the  supposed  connecting  principle  is  indeed  general,  true,  and 
perhaps  even  fully  confirmed  by  observation  of  all  cases,  it  is  incapable 
of  sustaining  a  counterfactual  because  it  remains  a  description  of  acci¬ 
dental  fact,  not  a  law.  The  truth  of  a  counterfactual  conditional  thus 
seems  to  depend  on  whether  the  general  sentence  required  for  the  in¬ 
ference  is  a  law  or  not.  If  so,  our  problem  is  to  distinguish  accurately 
between  causal  laws  and  casual  facts. 

Because  of  the  tight  connection  between  counterfactuals  and  causal  influ¬ 
ences,  any  algorithm  for  computing  counterfactual  queries  must  rely  heavily  on 
causal  knowledge  of  the  domain.  This  leads  naturally  to  the  use  of  probabilistic 
causal  networks,  since  these  networks  combine  causal  and  probabilistic  knowledge 
and  permit  reasoning  from  causes  to  effects  as  well  as,  conversely,  from  effects 
to  causes.  This  representation  also  reflects  the  separation  of  causal  knowledge 
from  transient  observations:  causal  knowledge  is  represented  by  the  structure  of 
the  network  and  its  parameterization,  while  observations  are  represented  by  the 
instantiation  of  nodes  within  the  network. 

To  emphasize  the  causal  character  of  counterfactuals,  we  will  adopt  the  in¬ 
terpretation  in  [Pea93c],  according  to  which  a  counterfactual  sentence  “If  A  were 
true,  then  B  would  have  been  true”  states  that  B  would  prevail  if  A  were  forced 
to  be  true  by  some  unspecified  intervention  that  is  exogenous  to  the  other  rela¬ 
tionships  considered  in  the  analysis.  This  intervention-based  interpretation  does 
not  permit  inferences  from  the  counterfactual  antecedent  towards  events  that  lie 
in  its  past.  For  example,  the  intervention-based  interpretation  would  ratify  the 
counterfactual 

If  Kennedy  were  alive  today,  then  the  country  would  have  been  in  a 
better  shape 

but  not  the  counterfactual 
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If  Kennedy  were  alive  today,  then  Oswald  would  have  been  alive  as 
well. 

The  former  is  admitted  because  the  causal  influence  of  Kennedy  on  the  country 
is  presumed  to  remain  valid  even  if  Kennedy  became  alive  by  an  act  of  God.  The 
second  sentence  is  disallowed  because  Kennedy  being  alive  is  not  perceived  as 
having  causal  influence  on  Oswald  being  alive.  The  information  intended  in  the 
second  sentence  is  better  expressed  in  an  indicative  mood: 

If  Kennedy  was  alive  today  then  he  could  not  have  been  killed  in 
Dallas,  hence.  Jack  Ruby  would  not  have  had  a  reason  to  kill  Oswald 
and  Oswald  would  have  been  alive  today. 

This  interpretation  of  counterfactual  antecedents,  which  is  similar  to  Lewis’ 
[Lew79]  Miraculous  Analysis,  contrasts  with  interpretations  that  require  that  the 
counterfactual  antecedent  be  consistent  with  the  world  in  which  the  analysis  oc¬ 
curs.  The  set  of  closest  worlds  delineated  by  the  intervention- based  interpretation 
contains  all  those  which  coincide  with  the  factual  world  except  on  possible  con¬ 
sequences  of  the  intervention.  The  probabilities  assigned  to  these  worlds  will  be 
determined  by  the  relative  likelihood  of  those  consequences  as  encoded  by  the 
causal  network. 

Finally,  the  counterfactuals  that  may  be  analyzed  within  the  context  of  this 
dissertation  are  limited  in  terms  of  the  form  of  the  antecedent  and  the  types  of 
causal  relationships.  Counterfactual  antecedents  will  be  limited  to  conjunctive 
clauses.  For  example  we  will  not  consider  the  veracity  of  the  following  counter- 
factual  conditionals: 

If  Bizet  and  Verdi  had  been  compatriots,  Bizet  would  have  been  Ital¬ 
ian. 

If  Bizet  and  Verdi  had  been  compatriots,  Verdi  would  have  been 
French. 

because,  Bizet  and  Verdi  being  compatriots  would  be  defined  as,  Bizet  and  Verdi 
are  Italians,  or  Bizet  and  Verdi  are  French,  or  Bizet  and  Verdi  are  . . .,  which  is  a 
disjunction  of  conjunctive  clauses. 

In  addition,  we  will  not  consider  counterfactual  conditionals  that  are  coun- 
terlegals;  a  counterlegal  is  defined  as  a  counterfactual  conditional  where  the  an¬ 
tecedent  is  impossible  (e.g.,  violates  some  strict  law).  For  example,  “if  this  circle 
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were  also  a  square,  . . .  Our  analysis  always  assumes  that  the  counterfactual 
antecedent  is  conceptually  compatible  (although  its  normal  coincidence  in  the 
world  may  be  infinitesimally  rare)  with  the  history  prior  to  the  antecedent.  In 
other  words,  it  must  always  be  possible  that  exceptions  exist  to  every  rule  in  the 
model  impinging  on  the  counterfactual  antecedent  variables. 


1.2  Firing  Squad  Example 


To  illustrate  the  intervention-based  interpretation  of  counterfactuals,  consider  a 
firing  squad  with  several  riflemen  (one  called  Bob)  and  a  Captain  who  gives  a 
signal  to  either  shoot  or  release  a  prisoner  charged  with  treason.  The  behavior 
of  these  agents  is  as  follows; 


•  The  Captain  waits  for  the  court  decision. 

•  Bob  typically  fires  his  rifle  if  and  only  if  the  Captain  gives  the  signal  to 
shoot. 

•  The  Traitor  typically  dies  if  and  only  if  the  Captain  gives  the  signal  to  shoot 
or  Bob  fires  his  rifle. 


Note  that  if  the  Captain  gives  the  signal  to  shoot  and  Bob  does  not  fire,  the 
traitor  will  typically  die  as  a  result  of  the  other  riflemen  shooting,  but  these 
intermediate  causes  will  not  be  made  explicit  in  this  story  in  order  to  keep  the 
model  simple. 


The  generic  causal  structure  that  reflects  this  description  may  be  represented 
by  the  structure  in  Figure  1.1.  The  three  variables  C,  B,  and  T  have  the  following 
domains: 


c  € 

b  e 
t  e 


Captain  gives  the  signal  to  release  the  traitor. 
Captain  gives  the  signal  to  shoot  the  traitor. 

Bob  does  not  fire  his  rifle.  1 
Bob  fires  his  rifle.  J 

Traitor  dies.  1 
Traitor  lives.  [ 


Now  consider  the  following  discussion  between  two  prison  guards  (Scott  and 
Dave)  who  looking  from  a  window  at  the  jail  could  only  see  that  Bob  fired  his 
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Figure  1.1:  Causal  structure  reflecting  the  influence  that  the  Captain’s  signal  has 

on  Bob’s  firing  and  the  Traitor’s  health,  and  the  direct  influence  that  Bob’s  firing 

has  on  the  Traitor’s  health. 

rifle  (b  =  bi): 

Dave:  The  Captain  must  have  given  the  signal  to  shoot,  or  Bob 

would  not  have  fired  his  rifle. 

Scott:  That  Traitor’s  body  must  be  riddled  with  bullets! 

Dave:  Yep.  If  Bob  were  not  to  have  fired,  the  Traitor  would 

still  have  died. 

Scott:  Ha!  If  Bob  were  not  to  have  fired,  the  Captain  must 

not  have  given  the  signal  to  fire,  and  none  of  the  other 
riflemen  would  have  fired.  Therefore,  the  Traitor  would 
still  be  alive.  ^ 

Dave:  No.  If  Bob  were  not  to  have  fired  despite  the  Captain’s 

signal,  the  other  riflemen  would  still  have  fired,  and  the 
Traitor  would  be  dead. 


In  the  fourth  sentence,  Scott  tries  to  explain  away  Dave’s  conclusion  by  claim¬ 
ing  that  Bob’s  not  firing  would  be  evidence  that  the  Captain  gave  the  signal  to 
release  the  Traitor  which  would  imply  that  none  of  the  riflemen  fired.  Scott, 
however,  analyzed  Dave’s  counterfactual  conditional  in  the  indicative  mood  by 
imagining  that  he  had  observed  Bob  not  firing  his  rifle;  this  allows  him  to  use 
the  observation  for  abductive  reasoning.  But  Dave’s  subjunctive  counterfactual 
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conditional  should  be  interpreted  as  leaving  everything  in  the  past  as  it  was  (in¬ 
cluding  conclusions  obtained  from  abductive  reasoning  from  real  observations) 
while  forcing  variables  to  their  counterfactual  values.  This  is  the  gist  of  his  last 
statement. 

This  example  demonstrates  the  plausibility  of  interpreting  the  counterfactual 
statement  in  terms  of  an  external  intervention  causing  Bob  to  not  fire,  regardless 
of  all  other  prior  circumstances.  The  only  variables  that  we  would  expect  to 
be  impacted  by  the  counterfactual  assumption  would  be  the  descendants  of  the 
counterfactual  variable;  in  other  words,  the  counterfactual  value  of  Bob’s  firing 
does  not  change  the  belief  in  the  Captain’s  signal  from  the  belief  prompted  by 
the  real-world  observation. 

The  claim  that  the  intervention-based  interpretation  of  counterfactual  an¬ 
tecedents  should  be  adopted  for  the  analysis  of  counterfactual  conditionals  in 
general  is  a  controversial  position.  This  interpretation  does  not  cover  all  lin¬ 
guistic  usages  of  counterfactuals;  however,  it  does  provide  clear  semantics  and 
a  precise  computational  formalism  for  analyzing  counterfactuals  given  a  causal 
description  of  the  world.  The  results  of  this  analysis  provide  useful  information 
about  the  effects  a  localized  change  to  a  single  variable  would  have  on  the  world. 
In  contrast,  it  is  not  clear  how  useful  other  nonintervention-based  interpretations 
of  counterfactuals  are,  because  they  imply  nothing  about  control  of  ones  environ¬ 
ment.  For  example,  some  counterfactual  interpretations  will  conclude  that  if  Bob 
were  not  to  have  fired,  then  the  Traitor  would  still  be  alive;  however  the  Traitor’s 
wife  is  not  going  to  bribe  Bob  not  to  fire,  because  she  knows  that  such  an  in¬ 
tervention  will  not  prevent  her  husband  from  being  executed.  It  is  not  the  goal 
of  this  dissertation  to  provide  a  model  of  all  linguistic  usages  of  counterfactuals, 
but  to  provide  an  interpretation  that  lends  itself  to  meaningful  application. 


1.3  Previous  work 

Counterfactual  conditionals  have  been  extensively  studied  by  the  philosophy  com¬ 
munity  over  the  last  twenty-five  years.  Of  the  more  compelling  research  has  been 
the  work  of  Stalnaker  and  Lewis,  from  which  possible-world  semantics  have  been 
developed.  These  semantics  formally  describe  how  a  closeness  (or  similarity) 
measure  between  worlds  can  be  used  to  evaluate  one’s  belief  in  a  counterfactual 
conditional.  Most  research  has  focussed  on  logical  inferences,  but  some  has  con¬ 
cerned  itself  with  the  probabilistic  evaluation  of  counterfactuals.  The  1986  paper 
by  Matt  Ginsberg  injected  this  important  topic  into  the  Artificial  Intelligence 
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community,  where  for  the  most  part,  the  concentration  has  been  on  logical  rather 
than  probabilistic  formalisms.  What  has  been  most  lacking  in  this  work  is  a  pre¬ 
cise  specification  of  similarity  between  worlds;  and  this  is  paramount  for  practical 
application  of  the  possible-world  semantics. 

One  popular  proposal  has  been  that  counterfactual  conditionals  may  be  ana¬ 
lyzed  by  applying  belief  revision  techniques,  where  possibly  contradictory  infor¬ 
mation  is  added  to  the  knowledge  base  and  information  is  retracted  in  order  to 
bring  about  a  consistent  set  of  knowledge  [Dal88,  GM94,  Gin86].  However,  the 
intervention-based  interpretation  of  counterfactual  antecedents  proposed  in’ this 
dissertation  is  clearly  not  consistent  with  belief  revision,  because  one  may  not 
reason  abductively  from  the  new  information  added  to  the  knowledge  base. 

The  remainder  of  this  section  will  review  some  important  contributions  in  the 
study  of  counterfactual  conditionals. 

1.3.1  Lewis’  closest- world  semantics 

Lewis  closest-world  semantics  [Lew76]  provides  an  intuitive  interpretation  to 
the  analysis  of  counterfactual  conditionals.  Just  find  the  world  most  similar 
to  our  observed  world,  such  that  the  counterfactual  antecedent  holds  true;  if 
the  counterfactual  consequent  holds  true  in  that  most  similar  world,  then  the 
counterfactual  conditional  is  said  to  hold  true. 

In  more  detail,  all  worlds  are  first  ordered  relative  to  the  observed  world, 
which  results  in  progressively  distant  “spheres  of  similarity”  surrounding  the  ob¬ 
served  world.  This  is  graphically  represented  in  Figure  1.2.  Of  interest  is  the 
closest  sphere  in  which  there  exist  worlds  where  the  counterfactual  antecedent 
A  holds  true.  Within  this  set,  the  relative  preponderance  of  worlds  where  the 
counterfactual  consequent  C  holds  true  is  evaluated,  leading  to  a  belief  in  the 
counterfactual  consequent  given  the  counterfactual  antecedent. 

In  [Lew76],  Lewis  proposed  a  method  for  evaluating  the  probability  of  Stal- 
naker  s  conditionals  [A  >  C)hy  imaging  the  set  of  possible  worlds  with  respect 
to  the  antecedent  A.  Assume  P{w)  is  the  distribution  over  all  worlds  conditioned 
on  our  partial  observation  of  the  world.  For  each  world  w,  there  is  an  imagined 
world  wa  that  is  most  similar  to  w  among  those  worlds  where  the  counterfactual 
antecedent  A  holds  true.  The  probability  of  the  worlds  imaged  on  A  (FU)  is  then 
evaluated  as 

Pa{w')  =  PH 
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Figure  1.2:  Graphical  representation  of  Lewis’  closest-world  semantics.  Each 
circular  region  corresponds  to  a  set  of  worlds  where  each  world  is  equally  similar 
to  w.  These  regions  are  called  spheres  of  similarity.  The  hashed  region  represents 
the  set  of  closest  worlds  where  the  counterfactual  antecedent  A  holds  true  and  the 
counterfactual  consequent  holds  true. 


Lewis  refers  to  this  new  distribution  over  all  worlds  as  the  “image  of  P  on  AP 
The  probability  of  Stalnaker’s  conditionals  P{A  >  C)  is  then  evaluated 


P{A>C)  =  Pa{C) 

=  Y. 

w’.w\=C 


Missing  from  this  discussion  of  Stalnaker  conditionals,  though,  is  a  precise 
formulation  of  closest  worlds  (which  is  crucial  to  imaging  worlds  and  their  associ¬ 
ated  probability  distribution).  In  Chapter  2  we  present  a  formalism  for  evaluating 
counterfactual  conditionals  that  is  consistent  with  Lewis’  formalism  for  evaluat¬ 
ing  Stalnaker  conditionals  via  imaging.  In  order  to  make  the  formalism  concrete, 
though,  the  notion  of  closest  worlds  must  be  formalized,  and  this  will  be  ac¬ 
complished  by  turning  to  the  causal  structure  of  the  world  and  interpreting  the 
counterfactual  antecedent  as  an  external  intervention  that  forces  the  antecedent 
to  be  true. 

1.3.2  Ginsberg 

[Gin86]  introduced  the  study  of  counterfactuals  to  the  Artificial  Intelligence  com¬ 
munity  as  an  important  facet  of  commonsense  reasoning,  and  discussed  several 
application  areas  that  could  benefit  from  the  analysis  of  counterfactual  condi¬ 
tionals. 

Ginsberg  presented  a  syntactic  interpretation  of  Stalnaker’s  closest-world  se¬ 
mantics.  In  his  formulation,  a  world  is  specified  by  a  set  of  logical  statements  5, 
and  a  closest  world  to  S  where  a  is  counterfactually  true  is  given  by  a  maximal 
subset  S'  of  S  such  that  S'  does  not  imply  a.  A  counterfactual  conditional  a>  c 
is  then  accepted  if  c  is  true  in  all  maximal  subsets  S' .  In  order  to  incorporate 
domain-dependent  information  to  this  strictly  semantic  interpretation,  Ginsberg 
suggests  the  use  of  a  “badworld”  predicate  to  explicitly  eliminate  worlds  from 
consideration.  In  addition,  a  partial  order  may  be  specified  over  all  subsets  of  S 
to  extend  the  set  inclusion  measure  of  closeness.  In  generating  possible  worlds, 
Ginsberg  suggests  (in  the  context  of  combinatorics)  that  rules  of  implication 
should  not  be  reversible;  however,  it  is  not  clear  whether  this  is  based  on  the 
belief  that  an  implication  represents  a  causal  relationship  which  is  not  transient. 

In  general,  though,  a  syntactic  intepretation  of  closeness  of  worlds  based  on 
set  inclusion  does  not  reflect  our  understanding  of  causal  relationships  in  the 
world.  Suppose  that  somebody  lined  up  26  dominoes  on  end,  and  then  tipped 
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the  first  domino  towards  the  second  domino,  creating  a  chain  reaction  that  finally 
toppled  all  the  dominoes.  Let  the  standing  state  of  these  dominoes  be  represented 
by  the  variables  A,  B,C,  Z,  where  cto,  bo,...,zo  indicate  that  dominoes  fell, 
while  di^bi, . . . ,  Zi  indicate  that  dominoes  stood.  In  a  syntactic  interpretation 
of  counterf actuals  such  as  that  proposed  by  Ginsberg,  one's  might  model  their 
knowledge  of  the  world  by  the  propositions: 

do 
ai 

bo 
bi 

Vo 

y\ 


Consider  the  counterfactual  query,  “If  domino  B  were  not  to  have  fallen,  would 
domino  Z  still  have  fallen?”  Intuitively,  we  reason  that  if  B  had  not  fallen,  then 
there  would  have  been  no  impetus  to  continue  the  chain  reaction  from  domino 
C  to  Z.  Therefore,  Z  would  not  have  fallen.  Under  our  intervention-based 
interpretation  we  would  still  state  that  A  would  have  fallen,  because  we  are 
considering  the  world  where  B  was  forced  to  stand  by  some  intervention,  e.g., 
domino  B  was  nailed  down.  This  world  is  given  by  {ao,  &i,  ci,  di, . . . ,  zi}. 

The  syntactic  approach,  though,  does  not  make  use  of  this  causal  information 
necessary  for  reaching  the  intuitive  conclusion.  This  approach  pursues  the  world 
that  retracts  the  fewest  propositions  in  the  above  set.  The  two  closest  worlds  (we 
only  show  the  fallen  state  of  each  domino)  are  given  by  {uq,  6i,  cq,  do,  cq,  . . . ,  zq} 
{bo,  ao  6o,  and  bi  ci  are  retracted)  and  {oi,  6i,  cq,  do,  cq,  . . . ,  zq}  (oo,  bo,  and 
bi  -4  Cl  are  retracted).  Clearly  there  is  a  disconnect  between  these  results  and 
intuition  coming  from  our  causal  knowledge,  because  the  syntactic  approach  does 
not  distinguish  transient  observations  from  generic  causal  relationships. 

If  we  do  follow  the  suggestion  that  statements  of  implication  should  not  be  re¬ 
tracted,  then  there  is  only  one  closest  world,  {ai,  6i,  ci,di,  d, . . . ,  zj,  which  leads 
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US  to  the  intuitive  belief  that  the  last  domino  would  not  have  fallen.  However, 
there  is  a  distinction  between  the  results  produced  by  this  approach  and  that 
produced  by  our  intervention-based  interpretation:  whether  the  counterfactual 
antecedent  abductively  implies  some  change  to  its  causal  effects  (assuming  that 
implication  is  linked  to  causality  in  the  syntactic  interpretation). 

Ginsberg  claims  that  counterfactuals  should  not  be  tied  too  closely  to  the  no¬ 
tion  of  causality.  Referring  to  the  counterfactual  conditional,  “If  John  had  koplic 
spots,  he’d  have  measles.”  he  states  “...  it  is  difficult  to  imagine  how  counter- 
factual  implication  can  capture  a  causal  relation  that  remains  asymmetric  even 
in  this  case.”  Under  our  intervention-based  interpretation,  this  counterfactual 
would  not  hold  true;  it  is  possible  that  another  mechanism  may  be  found  for 
generating  koplic  spots,  and  koplic  spots  do  not  cause  measles  on  their  own.  Of 
course,  if  we  observe  koplic  spots,  we  will  infer  that  the  subject  has  measles,  but 
this  is  not  the  nature  of  a  causal  counterfactual  conditional. 

1.3.3  Simon  and  Rescher 

Simon  and  Rescher  discussed  the  analysis  of  causal  counterfactual  conditionals 
in  [SR66].  This  work  is  important  for  its  formulation  of  counterfactuals  within 
a  causal  system,  and  its  distinction  it  makes  between  generic  causal  knowledge 
and  transient  observations. 

They  propose  that  when  the  counterfactual  antecedent  is  included  into  the 
knowledge  base,  inconsistencies  in  one’s  knowledge  must  be  retracted  without 
violating  any  causal  relationships  (“We  might  of  course  give  up  the  law  (T),  but 
this  course  is  obviously  undesirable.”).  In  order  to  determine  which  knowledge  is 
retracted,  each  variable  is  assigned  to  a  modal  category  according  to  its  distance 
down  the  causal  chain  from  the  exogenous  variables.  The  higher  the  modal 
category,  the  more  succeptible  the  variable’s  value  is  to  retraction. 

While  Simon  and  Rescher  choose  to  uphold  all  laws  in  one’s  model  of  the 
world,  our  intervention-based  interpretation  severs  the  causal  link  between  the 
antecedent  variables  and  their  modelled  set  of  causal  influences.  Consider  Simon 
and  Rescher ’s  wheat  growing  example,  where  fertilizer  (F)  and  rain  (R)  influence 
the  wheat  crop  (W),  while  the  wheat  crop  and  population  (N)  influence  the  wheat 
price  (P).  Figure  1.3  shows  this  causal  structure. 

Given  the  above  causal  structure,  Simon  and  Rescher’s  counterfactual  condi¬ 
tional,  “If  the  wheat  crop  had  been  smaller  last  year,  the  price  would  have  been 
higher”  is  consistent  with  our  intervention-based  interpretation;  however,  they 
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Figure  1.3:  Simon  and  Rescher’s  structure  representing  the  causal  relationships 
between  fertilizer  (F),  rain  (R),  wheat  crop  (W ),  population  (N ),  and  wheat  price 

(P)- 

find  the  following  statement  “perfectly  idiomatic”:  “If  the  wheat  crop  had  been 
smaller  last  year,  there  would  have  been  either  less  rain  or  less  fertilizer  applied.” 
This  contrasts  with  our  interpretation  which  leaves  our  belief  in  the  rain  and  fer¬ 
tilizer  amounts  unchanged.  Simon  and  Rescher’s  interpretation  fits  an  analysis 
where  the  antecedent  is  considered  to  be  a  passive  observation,  e.g.,  in  a  similar 
world  where  we  would  have  observed  a  smaller  wheat  crop,  either  there  was  less 
rain  or  less  fertilizer  was  applied.  However,  this  analysis  does  not  necessarily 
tell  us  the  causal  influence  that  a  change  in  wheat  crop  would  have  had  on  the 
world,  if  there  was  another  path  from  rain  or  fertilizer  to  the  wheat  price,  because 
the  value  for  variables  preceding  the  counterfactual  antecedent  may  still  affect 
variables  that  are  descendants  of  the  antecedent  variable. 


1.4  Applications 

In  this  section  the  importance  of  causal  counterfactual  reasoning  will  be  em¬ 
phasized  by  describing  some  of  the  tasks  that  benefit  from  such  analysis.  The 
common  occurrence  of  counterfactual  statements  in  everyday  human  discourse 
is  a  clear  tipoff  that  counterfactuals  are  an  integral  part  of  human  communica¬ 
tion.  Asides  from  the  efficiency  gained  in  communication,  formal  evaluation  of 
counterfactuals  is  important  to  system  design,  fault  diagnosis,  liability  litigation, 
policy  analysis,  etc.  Some  of  these  are  mentioned  in  [Gin86]. 


23 


1.4.1  Communication 


Counterfactuals  are  a  prevalent  aspect  of  daily  communication  between  humans, 
which  is  interesting,  because  very  often  the  conditional  statement  is  made  af¬ 
ter  an  irreversible  event  has  occurred.  For  example,  suppose  that  little  Johnny 
pulled  on  Sarah’s  pigtail  followed  by  Sarah  dumping  her  milk  shake  on  Johnny’s 
head.  Johnny,  totally  surprised,  turns  to  his  mother  and  cries  out  innocently  and 
indignantly  that  Sarah  has  done  something  terrible.  Johnny’s  mother,  having 
observed  the  whole  scene,  calmly  explains  to  Johnny,  “If  you  had  not  pulled  on 
Sarah’s  pigtail,  then  she  would  not  have  dumped  her  milk  shake  on  you.” 

This  counterfactual  conditional  is  useless  to  Johnny  at  this  point  in  time;  it 
will  not  bring  about  a  plan  to  get  clean,  and  it  will  not  allow  Johnny  to  exact 
retribution.  So  what  is  his  mother’s  point  in  making  this  statement?  Precisely  for 
conveying  information  to  Johnny  about  the  causal  relationship  between  pulling 
Sarah’s  hair  and  Sarah’s  subsequent  actions.  It  is  the  mother’s  belief  that  this 
information  will  allow  Johnny  to  formulate  a  belief  system  that  will  hopefully 
discourage  him  from  pulling  Sarah’s  hair  (at  least  when  he  does  not  want  to 
suffer  the  consequences). 

This  information  tells  Johnny  that  when  everything  else  is  held  fixed,  that  a 
local  change  to  Johnny’s  hair  pulling  would  evoke  a  change  to  the  outcome.  This 
is  more  informative  to  Johnny  then  the  statements,  “If  you  pull  Sarah’s  hair,  then 
she  punishes  you”  and  “If  you  do  not  pull  Sarah’s  hair,  then  she  will  not  punish 
you.”  This  might  not  convey  the  same  information  to  Johnny;  he  may  interpret 
this  to  mean  that  the  same  situation  where  he  would  pull  Sarah’s  hair  is  the  same 
situation  where  Sarah  is  going  to  pour  her  milk  shake  on  him,  in  which  case  he 
might  as  well  go  ahead  and  yank  her  hair.  Thus,  we  see  that  the  counterfactual 
conditional  conveys  the  isolated  causal  effect  of  Johnny’s  hair-pulling  on  Sarah’s 
response,  informing  Johnny  that  his  decision  whether  to  pull  Sarah’s  hair  will 
have  influence  on  Sarah’s  reaction. 

1.4.2  Liability  litigation 

Frequently,  the  analysis  of  counterfactual  conditionals  is  required  in  the  determi¬ 
nation  of  liability  in  legal  ca.ses.  A  plaintiff  might  claim  that  a  defendant’s  action 
or  product  has  inflicted  damages  on  their  person  or  property,  and  the  court  must 
analyze  the  following  types  of  questions.  “If  the  plaintiff  had  not  been  exposed  to 
the  product,  would  the  plaintiff  still  have  developed  his  current  illnes?”  Or,  “if 
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the  defendant  had  not  conspired  to  fix  prices  with  the  other  manufacturers  of  wid¬ 
gets,  would  the  plaintiff  not  have  lost  his  business?”  To  answer  these  questions, 
the  court  ponders  how  a  local  change  to  the  circumstances  of  the  plaintiff  (e.g., 
preventing  the  defendant’s  action,  or  removing  the  product  from  the  plaintiff’s 
environment)  would  have  effected  his  welfare  differently  than  actually  occurred. 
If  the  court  decides  that  the  local  change  would  have  prevented  the  plaintiff  from 
suffering  financial  or  personal  injury,  then  the  court  would  find  the  defendant 
liable  for  those  damages. 

In  Chapter  7  we  will  discuss  cases  where  the  analysis  of  counterfact uals  in¬ 
volves  statistical  models,  and  we  will  present  a  hypothetical  case  using  the  partial- 
compliance  model  of  Chapter  5  to  demonstrate  that  a  court  must  apply  coun- 
terfactual  probabilities  in  order  to  guarantee  proper  determination  of  liability  in 
product-safety  litigation. 

1.4.3  Policy  analysis 

In  the  clinical  study  of  new  drug  treatments,  researchers  wish  to  determine 
whether  or  not  a  particular  drug  will  improve  the  overall  rate  of  recovery  of  sub¬ 
jects  within  the  patient  population.  Subject’s  from  the  population  are  random¬ 
ized  into  one  of  two  treatment  groups  and  their  treatment  response  is  measured 
at  the  end  of  the  study.  However,  the  study  is  seldom  perfect:  patient’s  remove 
themselves  or  are  removed  from  the  study;  patient’s  do  not  comply  with  their 
treatment  assignment;  and  exogenous  influences  confound  the  results.  Given  the 
data  from  the  study,  the  researchers  wish  to  answer  the  following  counterfactual 
query:  “If  the  patient  population  were  uniformly  treated  with  the  drug  under 
study,  would  the  overall  recovery  rate  of  subjects  in  the  population  have  been 
higher  than  if  the  population  were  uniformly  given  a  placebo?” 

This  application  of  counterfactual  probabilities  will  be  explored  in  depth  in 
Chapters  5  and  6.  In  addition  an  example  demonstrating  economic  policy  analysis 
using  linear  structural  equation  models  will  be  presented  in  Chapter  8. 

1.5  Contributions 

The  principle  contributions  in  this  dissertation  consist  of: 

•  Specification  of  knowledge  representation  necessary  for  adequately  analyz¬ 
ing  counterfactual  conditionals. 
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•  Proposal  of  unambiguous  semantics  for  interpreting  the  meaning  of  coun- 
terfactual  antecedents  in  terms  of  intervention  by  an  external  force/action. 

•  Technique  for  evaluating  counterfactual  probabilities  when  a  functional 
model  of  a  domain  is  provided. 

•  Method  for  evaluating  bounds  on  counterfactual  probabilities  when  a  prob¬ 
abilistic  distribution  is  only  available  for  observable  variables,  i.e.,  a  func¬ 
tional  model  is  not  known.  A  program  is  available  for  deriving  closed-form 
bounds  when  the  counterfactual  probability  may  be  expressed  as  a  linear 
combination  of  terms  from  the  response- function  distributions. 

•  Formulas  for  evaluating  counterfactual  distributions  when  the  domain  is 
modelled  by  linear  structural  equations. 

•  Derivation  of  strict  bounds  on  average  treatment  effects  from  experimental 
studies  involving  partial  compliance. 

1.6  Overview 

Part  II  of  this  dissertation  is  concerned  with  the  theoretical  and  computational 
aspects  related  to  the  evaluation  of  counterfactual  probabilities.  In  this  first 
chapter  the  study  of  counterfactuals  has  been  motivated  and  the  intervention- 
based  interpretation  of  counterfactuals  —  adopted  for  this  research  —  has  been 
introduced.  In  Chapter  2,  a  formal  representation  of  knowledge  facilitating  the 
analysis  of  counterfactual  probabilities  will  be  described,  and  an  algorithm  for 
computing  these  probabilities  will  be  developed.  Counterfactual  probabilities 
may  only  be  uniquely  identified  when  the  background  knowledge  is  described  by 
a  functional  model.  In  addition,  formulas  are  derived  for  evaluating  counterfac¬ 
tual  distributions  when  background  knowledge  is  given  by  structural  equation 
models  with  normally  distributed  disturbances.  In  Chapter  3,  we  demonstrate 
how  bounds  on  counterfactual  probabilities  may  be  computed/derived,  when  the 
general  knowledge  of  the  world  is  described  by  a  causal  structure  and  conditional 
probabilities  over  the  observable  variables.  Chapter  4  discusses  the  evaluation  of 
counterfactual  conditionals  when  beliefs  are  represented  by  order-of-magnitude 
abstractions  of  probabilities. 

In  Part  III,  the  evaluation  of  counterfactuaJ  probabilities  will  be  demonstrated 
in  a  set  of  applications.  The  most  appealing  of  these  applications  is  the  evalua¬ 
tion  of  treatment  effects  in  studies  where  subjects  are  randomly  assigned  treat- 
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ment,  but  do  not  necessarily  comply  with  this  assignment.  This  task  has  been 
studied  by  [EF91]  with  a  concentration  on  continuous  values  of  treatment  con¬ 
sumed,  and  [Man90]  has  derived  nonparametric  bounds  on  treatment  effects  for 
generalized  treatment  domains.  In  Chapter  5,  we  derive  the  tightest-possible 
assumption-free  bounds  on  treatment  effects  from  partial  compliance  studies, 
improving  upon  the  results  of  Manski.  In  Chapter  6,  we  extend  these  results 
to  the  case  where  the  domain  of  treatment  values  is  continuous,  and  show  how 
these  bounds  may  be  further  tightened  when  the  continuous  domain  is  parti¬ 
tioned  into  ranges  of  homogeneous  treatment  responses.  In  Chapter  7  we  discuss 
the  potential  application  of  counterfactual  probabilities  in  legal  cases,  and  we 
present  a  hypothetical  case  where  proper  treatment  of  counterfactual  probabil¬ 
ities  is  important  for  correctly  determining  liability  in  product-safety  litigation. 
In  Chapter  8  we  demonstrate  the  application  of  counterfactual  reasoning  to  eco¬ 
nomic  policy-making  when  knowledge  is  given  by  structural  equation  models. 
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Part  II 


Computation 
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CHAPTER  2 


Count erfactual  probabilities 

2.1  Introduction 

This  chapter  will  show  that  causal  theories  specified  in  functional  form  (as  in 
[PV91,  DS93,  Poo93])  are  sufficient  for  evaluating  counterfactual  queries,  whereas 
the  causal  information  embedded  in  Bayesian  networks  is  not  sufficient  for  the 
task.  Every  Bayes  network  can  be  represented  by  several  functional  specifications, 
each  yielding  different  evaluations  of  a  counterfactual.  The  problem  is  that, 
deciding  what  factual  information  deserves  undoing  (by  the  antecedent  of  the 
query)  requires  a  model  of  temporal  persistence,  and,  as  noted  in  [Pea93d],  such  a 
model  is  not  part  of  static  Bayesian  networks.  A  functional  specification,  however, 
implicitly  contains  the  necessary  temporal  persistence  information. 

The  next  section  will  introduce  some  notation  for  concisely  expressing  coun¬ 
terfactual  probabilities.  Section  2.3  will  describe  the  relationship  between  prob¬ 
abilistic  and  functional  specifications,  and  will  demonstrate  that  probabilistic 
specifications  do  not  provide  sufficient  information  for  precisely  evaluating  coun¬ 
terfactual  probabilities.  Section  2.4  will  provide  an  algorithm  for  evaluating  coun¬ 
terfactual  probabilities,  given  a  functional  model  of  the  system  under  query.  The 
algorithm  will  then  be  applied  to  the  Firing-Squad  example  introduced  in  the 
previous  chapter.  In  Section  2.8  we  will  describe  how  counterfactual  conditionals 
may  be  analyzed  when  functional  assumptions  (e.g.,  linear-normal  models)  are 
imposed  on  a  model. 

It  is  assumed  that  the  reader  is  already  familiar  with  probabilistic  causal 
networks:  representation  and  inference  techniques.  If  not,  the  reader  is  referred 
to  [Pea88]. 

2.2  Notation 

Let  the  set  of  variables  describing  the  world  be  designated  by 
X  =  {Xi,X2j.  ■  ■ ,  Xn}.  As  part  of  the  complete  specification  of  a  counterfactual 
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query,  there  are  real-world  observations  that  make  up  the  background  context. 
These  observed  values  will  be  represented  in  the  standard  form  xi,  X2, . . . ,  Xn.  In 
addition,  we  must  represent  the  value  of  the  variables  in  the  counterfact ual  world. 
To  distinguish  between  Xi  and  the  value  of  X,  in  the  counterfactual  world,  we 
will  denote  the  latter  with  an  asterisk;  thus,  the  value  of  X,  in  the  counterfac¬ 
tual  world  will  be  represented  by  x*.  We  will  also  need  a  notation  to  distinguish 
between  events  that  might  be  true  in  the  counterfactual  world  and  those  refer¬ 
enced  explicitly  in  the  counterfactual  antecedent.  The  latter  are  interpreted  as 
being  forced  to  the  counterfactual  value  by  an  external  intervention,  which  will 
be  denoted  by  a  hat  (e.g.,  x). 

Thus,  a  typical  counterfactual  query  will  have  the  form  “What  is  P(c*|a*,  6)?” 
to  be  read  as  “Given  that  we  have  observed  B  =  h\n  the  real  world,  if  A  were  a, 
then  what  is  the  probability  that  C  would  have  been  c?” 

2.3  Probabilistic  vs.  functional  specification 

In  this  section  we  will  demonstrate  that  functionally  modeled  causal  theories 
[PV91]  are  necessary  for  uniquely  evaluating  counterfactual  queries,  while  the 
conditional  probabilities  used  in  the  standard  specification  of  Bayesian  networks 
are  insufficient  for  obtaining  unique  solutions. 

Reconsider  the  firing-squad  example  limited  to  the  two  variables  C  and  B, 
representing  the  Captain’s  signal  and  Bob’s  firing,  respectively.  Assume  that 
previous  behavior  shows  P{bi\ci)  =  0.9  and  P{bo\co)  =  0.9.  We  observe  the 
Captain  give  the  release  signal  and  Bob  not  fire,  and  then  wonder  with  what 
probability  Bob  would  have  fired  if  the  Captain  had  given  the  order  to  fire,  i.e., 
what  is  P(6i|cj,  cq,  bo)l  The  answer  depends  on  the  mechanism  that  accounts  for 
the  10%  exception  in  Bob’s  behavior.  If  the  reason  Bob  occasionally  does  not  fire 
(when  the  Captain  signals  to  shoot)  is  that  his  gun  has  jammed  and  he  is  unable 
to  fire,  then  the  answer  to  our  query  would  be  8/9  (this  result  will  be  evaluated  in 
detail  in  Section  3.3).  However,  if  the  only  reason  for  Bob’s  occasional  non-firing 
(when  the  Captain  signals  to  shoot)  is  that  he  got  the  signalling  instructions 
mixed-up,  then  the  answer  to  our  query  is  100%,  because  the  Captain’s  release 
signal  and  Bob’s  non-firing  proves  that  Bob  has  not  mixed  up  the  signals.  Thus, 
we  see  that  the  information  contained  in  the  conditional  probabilities  on  the 
observed  variables  is  insufficient  for  answering  counterfactual  queries  uniquely; 
some  information  about  the  mechanisms  responsible  for  these  probabilities  is 
needed  as  well. 
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The  functional  specification,  which  provides  this  information,  models  the  in¬ 
fluence  of  C*  on  J5  by  a  deterministic  function 

b  =  Fb{c,  Cb) 

where  ej  stands  for  all  unknown  factors  that  may  influence  B  and  the  prior  prob¬ 
ability  distribution  P{eb)  quantifies  the  likelihood  of  such  factors.  For  example, 
whether  Bob’s  gun  is  jammed  and  whether  Bob  has  the  signals  crossed  could 
make  up  two  possible  components  of  et-  Given  a  specific  value  for  tb,  B  be¬ 
comes  a  deterministic  function  of  C;  hence,  each  value  in  e{,’s  domain  specifies  a 
response  function  that  maps  each  value  of  C  to  some  value  in  B's  domain.  In 
general,  the  domain  for  tb  could  contain  many  components,  but  it  can  always 
be  replaced  by  an  equivalent  variable  that  is  minimal,  by  partitioning  the  do¬ 
main  into  equivalence  regions,  each  corresponding  to  a  single  response  function 
[Pea93a].  Formally,  these  equivalence  classes  can  be  characterized  as  a  function 
Tb  :  dom(e(,)  N,  as  follows: 


0  if  Fi,(co,  Cfc)  =  0  &  Fb{c^,eb)  =  0 
.  ,  _  1  if  Fblco,eb)  =  0  &  Fb{ci,eb)  =  1 

n[eb)  -  <  2  ifF,(co,C6)  =  l&Ffc(ci,e6)  =  0 

^  3  if  Fb{co,eb)  =  1  &  Fb{ci,eb)  =  1 

Obviously,  can  be  regarded  as  a  random  variable  that  takes  on  as  many  val¬ 
ues  as  there  are  functions  between  C  and  B.  This  domain-minimal  variable  will 
be  referred  to  as  a  response-function  variable,  rj,  is  closely  related  to  the  poten¬ 
tial  response  variables  in  Rubin’s  model  of  counterfactuals  [Rub74],  which  was 
introduced  to  facilitate  causal  inference  in  statistical  analysis  [BP93]. 

Suppose  that  a  variable  X  has  causal  influences  {Ui,U2, . . . ,  Uk}  in  a  proba¬ 
bilistic  causal  model.  Let  the  domain  size  of  each  influence  Ui  be  given  by  mi,  and 
the  domain  size  of  X  be  given  by  n.  The  domain  size  of  X’s  response-function 
variable  will  then  be  of  size 

n"^  (2.1) 

where 


k 

m  =  U  mi 
i=l 


(2.2) 


This  suggests  that  more  than  anything  else,  the  domain  sizes  and  fan-in  of  vari¬ 
ables  will  be  the  main  contributing  factors  to  the  computational  complexity  of 
evaluating  counterfactual  probabilities. 


31 


For  this  example,  the  response-function  variable  for  B  has  a  four-valued  do¬ 
main  Tf,  6  {0, 1,2,3}  with  the  following  functional  specification: 


b  =  fb{c,rb)  =  hb,rt,{c) 

(2.3) 

where  the  mappings  defined  by  each  response  function  hb,rb{<- 

:)  are  given  by 

hb,o{c)  =  bo 

(2.4) 

1  /  \  f  ^0  if  C  Cq 

'“'■W  =  (i,,  i,e  =  c. 

(2.5) 

[bo  if  c  =  Cl 

(2.6) 

^6,3(1^)  ~  b\ 

(2.7) 

The  prior  probability  of  these  response  functions  P(rj,)  in  conjunction  with  fb{c,  Vb) 
fully  parameterizes  the  relationship  between  C  and  B  in  the  model. 

For  each  observable  variable  Xi,  there  is  a  function  that  maps  the  value  of 
Jfj’s  observable  causal  influences  pa(J!r,)  and  Xi's  response-function  variable 
to  the  value  of  Xi 


If  the  model  is  complete  (such  as  the  functional  model  described  in  [PV91]),  all 
response  functions  will  be  mutually  independent,  and  each  will  be  characterized 
by  a  prior  probability  P{rxi)-  However,  when  some  variables  are  left  out  of  the 
analysis,  the  response  functions  of  the  remaining  variables  (a:i, . . . ,  a:„)  may  be 
dependent  and,  in  principle,  a  joint  probability  P{rx^ .,rx„)  would  be  required. 
In  practice,  only  local  dependencies  will  be  needed. 

If  one  assumes  that  two  variables  C  and  B  are  dependent  via  some  exogenous 
common  cause,  then  we  create  an  edge  between  Tc  and  rj,  and  specify  the  joint 
distribution  P(rc,  rj).  This  treatment  of  latent  variables  will  be  utilized  in  the 
applications  discussed  in  Sections  5.1  and  7.2. 

Given  P{rb),  we  can  uniquely  evaluate  the  counterfactual  query  “What  is 
P(6i|ci,co,  6o)?”  (he-)  “Given  C  =  cq  and  B  =  6o,  if  C  were  ci,  then  what  is  the 
probability  that  B  would  have  been  6i?”).  The  intervention-beised  interpreta¬ 
tion  of  counterfactual  antecedents  implies  that  the  disturbance  ej,,  and  hence  the 
response-function  r^,  is  unaffected  by  the  interventions  that  force  the  counterfac¬ 
tual  values;  therefore,  what  we  learn  about  the  response-function  from  the  ob¬ 
served  evidence  is  applicable  to  the  evaluation  of  belief  in  the  counterfactual  con- 

I 
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sequent.  If  we  observe  (cq,  bo),  then  we  are  certain  that  rj  G  {0, 1},  an  event  hav¬ 
ing  prior  probability  P(r6=0)  -|-P(r6=l).  Hence,  this  evidence  leads  to  an  updated 
posterior  probability  for  n  (let  P(rf,)  =  {P{rb-0),  P{rb=l),  P{ri,=2),  P{n-S))) 

P\rb)  =  P{rb\co,bo)  = 

( _ fire?) _ finzl) _ „  o) 

'P(rt=0)  +  />(r,=l)’  />(r,=0)  +  P(r,=l)’  ’ 


According  to  Eqs.  2. 3-2. 7,  if  C  were  forced  to  ci,  then  B  would  have  been  6i 
if  and  only  if  rb  €  {1,3},  which  has  probability  P'(rf,=l)  +  P'(r6=3)  =  P'{rb=l). 
This  is  exactly  the  solution  to  the  counterfactual  query. 


P{bl\cl,co,bo)  =F(r,=l)  = 


P(r6=l) 


P(?’6=0)  +  P{rb=l)' 

This  analysis  is  consistent  with  the  prior  propensity  account  of  [SkySO]. 


What  if  we  are  provided  only  with  the  conditional  probability  P{b\c)  instead 
of  the  functional  model  {fb{c,  rb)  and  ^(rj))?  These  two  specifications  are  related 
by: 


P(6i|co)  =  P{rb=2)  +  P{rb=Z) 

P{bi\ci)  =  P(r6=l) -b  P(ri,=3). 

which  show  that  P{rb)  is  not,  in  general,  uniquely  determined  by  the  conditional 
distribution  P(6|c). 

Hence,  given  a  counterfactual  query,  a  functional  model  always  leads  to  a 
unique  solution,  while  a  Bayesian  network  seldom  leads  to  a  unique  solution, 
depending  on  whether  the  conditional  distributions  of  the  Bayesian  network  suf¬ 
ficiently  constrain  the  prior  distributions  of  the  response-function  variables  in 
the  corresponding  functional  model.  In  Chapter  3  we  will  develop  techniques  for 
evaluating  bounds  on  counterfactual  probabilities  when  only  given  conditional 
probability  distributions  on  the  observable  variables. 

In  practice,  specifying  a  functional  model  is  not  as  daunting  as  one  might 
think  from  the  example  above.  In  fact,  it  could  be  argued  that  the  subjective 
judgments  needed  for  specifying  Bayesian  networks  (i.e.,  judgments  about  con¬ 
ditional  probabilities)  are  generated  mentally  on  the  basis  of  a  stored  model  of 
functional  relationships.  For  example,  in  the  noisy-OR  mechanism,  which  is  often 
used  to  model  causal  interactions,  the  conditional  probabilities  are  derivatives  of 
a  functional  model  involving  AND/OR  gates,  corrupted  by  independent  binary 
disturbances.  This  model  is  used,  in  fact,  to  simplify  the  specification  of  condi¬ 
tional  probabilities  in  Bayesian  networks  [Pea88]. 
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2.4  Evaluating  counterfact ual  queries 


From  the  last  section,  we  see  that  the  algorithm  for  evaluating  counterfactual 
queries  should  consist  of  the  following  steps:  (1)  compute  the  posterior  prob¬ 
abilities  for  the  disturbance  variables,  given  the  observed  evidence;  (2)  remove 
the  observed  evidence  and  enforce  the  value  for  the  counterfactual  antecedent; 
finally,  (3)  evaluate  the  probability  of  the  counterfactual  consequent,  given  the 
conditions  set  in  the  first  two  steps. 

An  important  point  to  remember  is  that  it  is  not  enough  to  compute  the 
posterior  distribution  of  each  disturbance  variable  (e)  separately  and  treat  those 
variables  as  independent  quantities.  Although  the  disturbance  variables  are  ini¬ 
tially  independent,  the  evidence  observed  tends  to  create  dependencies  among  the 
parents  of  the  observed  variables,  and  these  dependencies  need  to  be  represented 
in  the  posterior  distribution.  An  efficient  way  to  maintain  these  dependencies  is 
through  the  structure  of  the  causal  network  itself. 

Thus,  we  will  represent  the  variables  in  the  counterfactual  world  as  distinct 
from  the  corresponding  variables  in  the  real  world,  by  using  a  separate  network  for 
each  world.  Evidence  can  then  be  instantiated  on  the  real-world  network,  and 
the  solution  to  the  counterfactual  query  can  be  determined  as  the  probability 
of  the  counterfactual  consequent,  as  computed  in  the  counterfactual  network 
where  the  counterfactual  antecedent  is  enforced.  But,  the  reader  may  ask,  and 
this  is  key,  how  are  the  networks  for  the  real  and  counterfactual  worlds  linked? 
Because  any  exogenous  variable,  is  not  influenced  by  forcing  the  value  of 
any  endogenous  variables  in  the  model,  the  value  of  that  disturbance  will  be 
identical  in  both  the  real  and  counterfactual  worlds;  therefore,  a  single  variable 
can  represent  the  disturbance  in  both  worlds.  Ca  thus  becomes  a  common  causal 
influence  of  the  variables  representing  A  in  the  real  and  counterfactual  networks, 
respectively,  which  allows  evidence  in  the  real-world  network  to  propagate  to  the 
counterfactual  network. 

Assume  that  we  are  given  a  causal  theory  T  =  {D,  ©p)  as  defined  in  [PV91]. 
D  IS  a,  directed  acyclic  graph  (DAG)  that  specifies  the  structure  of  causal  influ¬ 
ences  over  a  set  of  variables  X  =  {Xi, ^2, . . . , X„}.  ©£>  specifies  a  functional 
mapping  x,-  =  /,(pa(xj),  e,)  (pa(x,)  represents  the  value  of  X,’s  parents)  and  a 
prior  probability  distribution  P{€i)  for  each  disturbance  e,-  (we  assume  that  e,’s 
domain  is  discrete;  if  not,  we  can  always  transform  it  to  a  discrete  domain  such  as 
a  response-function  variable).  A  counterfactual  query  “What  is  P{c*\a*,o)T'  is 
then  posed,  where  c*  specifies  counterfactual  values  for  a  set  of  variables  C  C  X, 
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a*  specifies  forced  values  for  the  set  of  variables  in  the  counterfactual  antecedent, 
and  o  specifies  observed  evidence.  The  solution  can  be  evaluated  by  the  following 
algorithm: 


1.  From  the  known  causal  theory  T  create  a  Bayesian  network  <  > 

that  explicitly  models  the  disturbances  as  variables  and  distinguishes  the 
real  world  variables  from  their  counterparts  in  the  counterfactual  world. 
G  is  a  DAG  defined  over  the  set  of  variables  V  =  X  VJ  X*  U  e,  where 
Ai  =  {Xi^X2^ . . .  ,Xn}  is  the  original  set  of  variables  modeled  by  T,  X*  = 
{Xi, . . . ,  A'*}  is  their  counterfactual  world  representation,  and  e  = 
{ci,  €2, . . . ,  e„}  represents  the  set  of  disturbance  variables  that  summarize 
the  common  external  causal  influences  acting  on  the  members  of  X  and 
X*.  V  is  the  set  of  conditional  probability  distributions  F(Vi|pa(Vi))  that 
parameterizes  the  causal  structure  G. 


If  Xj  G  pa(A"i)  in  D,  then  Xj  G  pa(A:,)  and  XJ  G  pa(A:;)  in  G  (pa(A:i) 
is  the  set  of  X^s  parents).  In  addition,  e,-  G  pa(Aii)  and  e,-  G  pa(A:*)  in 
G.  The  conditional  probability  distributions  for  the  Bayesian  network  are 
generated  from  the  causal  theory: 


P{xi\pa.x{xi),ei)  =  I 
where  pa;^ (a:,)  is  the  set  of  values  of  the  variables  in  X  fl  pa(a;j). 


1  if  Xi  = /i(pa;f(a:i),e,) 
0  otherwise 


P(x*|pa;f.(x*),ei)  =  P(x,ipa;f(x, •),€,) 

whenever  x,-  =  x^  and  pa;j-.(x^)  =  pa;f(x,).  P{ei)  is  the  same  as  specified 
by  the  functional  causal  theory  T. 


2.  Observed  evidence.  The  observed  evidence  o  is  instantiated  on  the  real 
world  variables  X  corresponding  to  o. 

3.  Counterfactual  antecedent.  For  every  forced  value  in  the  counterfactual 
antecedent  specification  x*  G  a*,  apply  the  intervention-based  semantics 
of  set{X*  =  X*)  (see  [Pea93a,  SGS93]),  which  amounts  to  severing  all  the 
causal  edges  from  pa(A:*)  to  X*  for  all  x*  G  d*  and  instantiating  X*  to  the 
value  specified  in  a*. 


4.  Belief  propagation.  After  instantiating  the  observations  and  interventionss 
in  the  network,  evaluate  the  belief  in  c*  using  the  standard  belief  update 
methods  for  Bayesian  networks  [Pea88].  The  result  is  the  solution  to  the 
counterfactual  query. 
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Note  that  all  evidence  does  not  have  to  come  in  the  form  of  concrete  obser¬ 
vations;  evidence  can  also  be  given  as  likelihood  information.  For  example,  we 
may  receive  a  report  from  one  of  the  other  guards  that  he  saw  a  pardon  on  the 
Warden’s  desk,  but  he  did  not  know  which  Traitor  it  was  for.  We  may  quantify 
this  evidence  (e  =  guard  seeing  pardon)  by  a  likelihood  vector  which  indicates 
the  relative  chance  that  the  evidence  came  in  given  the  Captain’s  signal,  i.e., 
Ac  =  (P(e|co),  F(e|ci)).  [Pea88]  describes  how  such  evidence  is  used  to  update 
our  beliefs  in  a  Bayesian  network.  Additional  notation  is  needed  to  add  this 
virtual  evidence  into  the  specification  of  a  counterfactual  probability. 

In  the  last  section,  we  noted  that  the  conditional  distribution  P(rcfc|pa(A'jt)) 
for  each  variable  Xk  £  X  constrains,  but  does  not  uniquely  determine,  the  prior 
distribution  P{ek)  of  each  disturbance  variable.  Although  the  composition  of  the 
external  causal  influences  are  often  not  precisely  known,  a  subjective  distribution 
over  response  functions  may  be  assessable.  If  a  reasonable  distribution  can  be 
selected  for  each  relevant  disturbance  variable,  the  implementation  of  the  above 
algorithm  is  straightforward  and  the  solution  is  unique;  otherwise,  bounds  on 
the  solution  can  be  obtained  using  convex  optimization  techniques.  In  the  next 
chapter,  we  will  explain  how  such  optimization  tasks  are  formulated,  and  Chap¬ 
ter  5  applies  this  technique  for  deriving  bounds  on  causal  effects  from  partially 
controlled  experiments. 


2.5  Firing  Squad  Revisited 

Let  us  revisit  the  firing  squad  example.  Assuming  we  have  observed  that  Bob 
fired  his  rifle  (6  =  6i),  we  want  to  know  with  what  probability  the  Traitor  would 
have  lived  if  Bob  had  not  fired  his  rifle  (i.e.,  “What  is 

Suppose  that  we  are  supplied  with  the  following  causal  theory  for  the  model 
in  Figure  1.1: 


where 


c  = 

fc{rc) 

^c,rc() 

b  = 

Me,  n) 

t  = 

/«(^  C,  rt) 

11 

(  0.40 

if  Tc  =  0 

P{rc)  = 

1  0.60 

if  Tc  =  1 
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and 


Pin) 


Pin) 


'  0.02  if  rj  =  0 

0.90  if  rf,  =  1 

^  0.08  ifrf,  =  2 
^0  if  r;,  =  3 

0.01  if  r-i  =  0 

0.40  ifri  =  8 
0.09  if  n  =  10 

^  0.35  if  rt  =  12 

0.13  ifrt  =  14 
0.02  if  rt  =  15 

0  otherwise 


hcfii)  —  Co 
^c,i()  ~  n 


htflib,  c) 
K^ib,  c) 

htaib,  c) 

K^ib,  c) 

b‘t,4ib^  c) 

ht,5ib,  c) 


f  to  if  (6,c)  7^  (6i,ci) 
\  ti  if  (6,c)  =  (6i,ci) 

f  to  if  (6,  c)  ^  (6o,  Cl) 
\  U  if  (6,c)  =  (6o,Ci) 

(to  if  c  =  Co 
\  h  if  c  =  Cl 

(  to  if  (6,c)  7^  (6i,co) 
\  ti  if  (6,c)  =  (6i,co) 

f  fo  iib  =  bo 

1  fi  if  6  =  &i 


b’tfiibj  c) 
ht,7ib,c)  = 
htfiib,  c)  = 
ht,oib,c)  = 


<0  if  (fe,c)  e  {(6o,co),(6i,Ci)} 
ti  if  (6,c)  €  {(6i,co),(6o,ci)} 

to  if  (6,c)  =  (6o,co) 

ti  if  (6,  c)  7^  (6o,  Co) 

to  if  (6,c)  7^  (6o,co) 

ti  if  (6,  c)  =  (6o,  Co) 

to  if  (6,c)  e  {(6i,co),(6o,Ci)} 

ti  if  (6,c)  €  {(6o,co),(6i,ci)} 
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ht, io{b,  c) 

ht,u{bi  c) 

ht,i2{b,  c) 

ht,i3{b,  c) 

htMib,  c) 
ht,i5ib,  c) 

The  response  functions  for  B  (/i6,rj  take  the  same  form  as  that  given  in  Eq.  (2.7). 

These  numbers  reflect  the  authors’  understanding  of  the  dynamics  involved. 
For  example,  the  choice  for  P{rb)  represents  our  belief  that  Bob  usually  fires  if 
and  only  if  the  Captain  gives  the  signal  to  fire.  However,  we  believe  that  Bob  is 
sometimes  (~  2%  of  the  time)  unable  to  fire  (e.g.,  his  gun  jams);  this  exception 
is  represented  by  rj  =  0.  In  addition.  Bob  sometimes  (~  3%  of  the  time)  fires  if 
and  only  if  the  Captain  gives  the  order  to  release  the  Traitor  (e.g..  Bob  has  his 
signals  crossed);  this  exception  is  represented  by  rj,  =  2. 

Finally,  P(r<)  represents  our  understanding  that  there  is  a  slight  chance  (1%) 
that  the  Traitor  is  “scared  to  death”  (rt  =  0)  and  a  slight  chance  (2%)  that  all 
of  the  riflemen  miss  their  target  (rt  =  15).  In  addition,  the  chances  that  different 
combinations  of  riflemen  inflict  a  lethal  wound  are  broken  down  as  follows:  40% 
of  the  time  both  Bob  and  the  other  riflemen  are  on  their  mark  (rt  =  8);  9%  of  the 
time  only  Bob  is  on  his  mark  (rt  =  10);  35%  of  the  time  only  the  other  riflemen 
are  on  their  mark  (rt  =  12);  and  13%  of  the  time  it  takes  the  combined  influence 
of  Bob  and  the  other  riflemen  to  inflict  a  lethal  wound  (rt  =  14). 

Figure  2.1  shows  the  Bayesian  network  generated  from  step  1  of  the  algorithm. 
After  instantiating  the  real  world  observations  (bo)  and  the  interventions  (6J) 
specified  by  the  counterfactual  antecedent  in  accordance  with  steps  2  and  3,  the 
network  takes  on  the  configuration  shown  in  Figure  2.2. 

If  we  propagate  the  evidence  through  this  Bayesian  network,  we  will  arrive  at 
the  solution 

P{t*\b;,b^)  =  0.15 


to  ii  b  =  b\ 
ti  if  6  =  6o 

to  if  (6,c)  =  (6i,co) 

ti  if  (6,c)  7^  (6i,Co) 

to  if  c  =  Cl 
ti  if  c  =  Co 

to  if  (6,c)  =  (6o,ci) 
ti  if  (6,c)  7^  (6o,ci) 

to  if  (6,c)  =  (6i,ci) 
ti  if  (6,c)  7^  (6i,ci) 
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Figure  2.1:  Bayesian  model  for  evaluating  counterfactual  queries  in  the  fir¬ 
ing-squad  example.  The  variables  marked  with  *  make  up  the  counterfactual 
world,  while  those  without  *,  the  factual  world.  The  r  variables  index  the  re¬ 
sponse  functions. 

which  is  consistent  with  Dave’s  assertion  that  the  Traitor  would  still  have  died 
had  Bob  not  fired,  given  that  Bob  had  actually  fired.  Compare  this  with  the 
solution  to  Scott’s  indicative  counterfactual  query: 

P{h\bo)  =  0.88. 

that  is,  if  we  had  observed  that  Bob  did  not  fire,  the  Traitor  probably  would 
not  have  died.  This  emphasizes  the  difference  between  the  intervention-based 
interpretation  and  a  revisionist  interpretation  of  counterfactual  conditionals. 


2.6  Complexity  Issues 

The  complexity  of  belief  update  in  a  probabilistic  network  is  dependent  on  the 
structure  of  the  network  along  with  the  variables  for  which  evidence  is  available. 
If  the  structure  of  causal  knowledge  is  given  by  a  directed  tree,  then  belief  update 
occurs  in  parallel  at  each  node,  requiring  only  a  polynomial  number  of  calculations 
in  terms  of  the  variables’  domain  sizes.  If  the  network  is  not  a  directed  tree,  but 
there  are  no  loops  in  the  network  (i.e.,  polytrees),  then  the  complexity  becomes 
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Figure  2.2:  To  evaluate  the  query  P{tl\bQ,  bi),  the  network  of  Figure  2.1  is  instan¬ 
tiated  with  observation  bi  and  intervention  b^  (links  pointing  to  b^  are  severed). 


exponential  in  terms  of  the  number  of  parents  of  each  variable  [Pea88,  p.  183] .  For 
unrestricted  directed  acyclic  graphs,  the  greatest  increase  in  complexity  comes 
about  from  cycles  in  the  structure  induced  by  observations  on  child  variables. 
For  example,  if  the  causal  structure  is  given  hyA—yB,A—^C,B—^D, 
and  C  D,  then  a  cycle  would  be  induced  on  the  network  by  observation  of 
D.  [Pea88]  discusses  various  ways  of  updating  beliefs  when  induced  cycles  are 
present  in  the  network. 

These  same  results  apply  to  the  computation  of  counterfactual  probabilities; 
however,  given  a  probabilistic  causal  model  for  a  system,  computing  a  counter- 
factual  probability  is  much  more  expensive  than  computing  a  similar  conditional 
probability,  because  response-function  variables  must  be  specified  whose  domains 
grow  in  size  according  to  Eqs.  (2.1)  and  (2.2). 

A  network  generated  by  the  algorithm  in  Section  2.4  may  often  be  simplified, 
because  not  all  response-function  variables  need  to  be  generated,  as  dictated  by 
the  following  theorem: 

Theorem  2.6.1  A  response-function  variable  r^  for  the  variable  X  is  necessary 
for  evaluating  a  counterfactual  probability  P{c*\a*,o)  if  and  only  if 

9  X  is  a  descendant  of  any  of  the  variables  specified  in  the  counterfactual 
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antecedent  a* , 

•  evidence  is  available  for  either  X  or  one  of  its  descendants  in  the  factual 
world,  and 

•  the  relationship  between  X  and  its  known  causal  influences  is  nondetermin- 
istic. 

Proof: 

If  a  variable  X*  in  the  counterfactual  world  is  not  a  causal  descendant  of  any 
of  the  variables  mentioned  in  the  counterfactual  antecedent  a*,  then  Xj  and 
Xj  will  always  have  identical  distributions,  because  the  causal  influences 
that  functionally  determine  Xj  and  Xf  are  identical.  Xj  and  Xj  may 
therefore  be  treated  as  the  same  variable.  In  this  case,  the  conditional 
distribution  P(a:j|pa(a:j))  is  sufficient,  and  the  disturbance  variable  Cj  and 
its  prior  distribution  need  not  be  specified. 

If  Xj  is  a  causal  descendant  of  one  of  the  variables  in  the  counterfactual 
antecedent,  but  neither  Xj  nor  Xj's  descendants  have  been  observed  in  the 
real  world,  then  the  observations  in  the  real  world  provide  no  information 
about  the  distribution  of  Xj's  response-functions  r^j.  Therefore,  all  we 
need  to  know  is  the  conditional  probability  distribution  P(a:j|pa(xj))  for 
evaluating  the  counterfactual  probability. 

If  X  is  already  a  deterministic  function  of  its  causal  influences,  then  a 
response-function  variable  becomes  redundant,  because  one  of  the  response 
functions  will  have  probability  of  one,  and  no  observed  evidence  will  change 
this  prior  distribution.  Thus,  the  functional  mapping  remains  the  same  in 
both  the  real  and  counterfactual  worlds. 

However,  if  all  three  conditions  above  are  true,  then  a  response-function 
variable  is  necessary  because  (1)  the  evidence  on  X  and/or  propa¬ 
gated  from  its  descendants  produces  a  posterior  distribution  on  r^,,  that,  in 
essence,  changes  the  conditional  distribution  P(®j|pa(xj))  in  the  counter- 
factual  world;  and  (2)  the  causal  influences  on  X  (besides  r^,)  have  different 
values  (or  a  different  distribution  of  values)  between  the  real  and  counter- 
factual  worlds.  This  means  that  we  must  know  how  the  mapping  from  one 
valuation  of  causal  influences  to  a  child  value  is  related  to  the  mapping  from 
another  valuation  of  the  causal  influences;  this  relationship  is  provided  by 
the  distribution  on  the  response-function  variable. 
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Figure  2.3:  Hypothetical  model  with  full  functional  specification  (all 
sponse-function  variables). 


re- 


□ 

Important  in  this  discussion  is  that  for  evaluating  a  particular  counterfactual 
probability  a  specification  of  response-functions  and  their  prior  distribution  are 
only  necessary  for  a  subset  of  the  variables  in  the  probabilistic  causal  model. 
Consider  a  causal  model  over  the  variables  {U,W,X,Y.,Z}  with  the  structure 
shown  in  Figure  2.3.  This  model  is  parameterized  by  the  the  conditional  prob¬ 
ability  distributions  P{x),  P(u|a:),  P{y\x,u),  P(iw|x,y),  and  P{z\y)  and  consists 
of 

{Mx  —  1)  -b  Mx{My,  —  1)  -b  MxMu{My  —  1)  -b  MxMyi^MxU  —  1)  -b  My^M^  —  1) 

independent  parameters,  where  Mx  is  the  variable  X's  domain  size. 

In  order  to  evaluate  counterfactual  probabilities  for  this  model  in  general,  we 
would  generate  the  combined  functional  model  for  the  factual  and  counterfactual 
worlds.  The  structure  for  this  functional  model  is  shown  in  Figure  2.4.  This 
structure  is  parameterized  by  the  prior  probability  distributions  on  the  response- 
function  variables,  which  consists  of 

(Mx  -  1)  -b  (Mf*  -  1)  -b  -  1)  -b  -  1)  -b  (Mf*-  -  1) 

independent  parameters.  It  would  be  desirable  to  avoid  having  to  specify  all 
parameters  associated  with  the  response-function  distributions.  Fortunately,  not 


Figure  2.4.  Hypothetical  model  with  full  functional  specification  (all  re¬ 
sponse-function  variables). 

all  response-functions  are  relevant  to  the  evaluation  of  specific  counterfactual 
probabilities.  For  example,  suppose  that  we  need  to  evaluate  P{x*,y*,z*\u*,w). 

Applying  Theorem  2.6.1  we  can  eliminate  from  consideration  the  response- 
function  variables  fov  X,U,  and  Z.  r^is  eliminated,  because  X's  causal  influences 
in  the  factual  and  counterfactual  world  will  always  be  the  same;  therefore,  we 
may  treat  X  and  X*  as  one  and  the  same  (we  will  notate  this  variable  by 
and  only  need  to  specify  P{x).  is  eliminated,  because  U*  is  subjected  to  a 
local  change  despite  the  value  of  the  response-function;  therefore,  no  information 
IS  conveyed  from  U  to  U*  and  we  only  need  to  specify  P(u|x).  is  eliminated, 
because  the  observations  in  the  factual  world  do  not  propagate  to  r^;  therefore, 
the  posterior  probability  on  r^  is  identical  to  its  prior  probability.  Hence,  P[z\y)  is 
sufficient  for  parameterizing  the  relation  between  Y*  and  Z*  in  the  counterfactual 
world.  In  fact,  the  variable  Z  in  the  real  world  is  irrelevant  to  the  evaluation  of 
the  counterfactual  probability  and  may  itself  be  eliminated.  If  we  perform  these 
simplifications  in  the  model,  we  are  left  with  the  structure  shown  in  Figure  2.5. 
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Figure  2.5:  Simplified  functional  specification  for  a  given  observation  on  W  and 
counterfactual  antecedent  specifying  just  U.  Note  that  the  response  functions  r®, 
r„,  and  r^  do  not  require  specification. 
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2,7  Statistical  independence  and  count erfactual  proba¬ 
bilities 

Statistical  independence,  e.g.,  P{bi\ao)  =  P{bi\ai),  does  not  give  us  permission 
to  remove  a  causal  edge  from  a  probabilistic  specification.  If  there  is  in  fact  a 
direct  causal  influence  from  A  to  B,  then  a  functional  specification  for  the  model 
can  lead  to  divergent  values  for  the  counterfactual  probability  P{bl\al,ao,bo). 
For  example,  suppose  that 


P{bi\ao)  =  0.50 
P(fei|ai)  =  0.50 

We  can  imagine  two  distributions  over  B's  response  functions  consistent  with  the 
conditional  probability  distribution  P{b\a): 


and 


Pi(r6=0) 

=  0.5 

Pi{rb=l) 

=  0.0 

Pi{n=2) 

=  0.0 

Piin=S) 

=  0.5 

P2{rb=0) 

=  0.0 

P2{rb=l) 

=  0.5 

P2{rb=2) 

=  0.5 

P2{rb=^) 

=  0.0 

uates  P{bl\al,ao,bo)  =  0.0,  while  the  second  distribution  P2{b\a)  shows  B  de¬ 
terministically  dependent  on  A  and  evaluates  P(6j|a*,  Uq,  6o)  =  1-0.  Thus,  it  is 
crucial  that  the  dependencies  in  the  causal  model  are  determined  by  more  than 
statistical  considerations,  but  also  by  subjective  knowledge  of  causal  effects. 


2.8  Parametric  and  canonical  models 

The  advantage  of  using  specialized  models,  e.g.,  parametric  or  canonical,  arises 
from  the  reduction  in  the  number  of  parameters  necessary  for  completely  spec¬ 
ifying  the  model.  The  exponential  size  of  response-function  variable  domains 
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Figure  2.6:  Unconstrained  model  of  two  known  variables  influencing  a  third. 

provides  a  strong  incentive  to  develop  and  apply  these  canonical  representations. 
The  reduction  in  parameters  can  possibly  lead  to  more  efficient  computations,  but 
will  never  lead  to  less  efficiency,  because  the  complete  response-function  model 
may  always  be  generated  from  the  reduced-parameter  canonical  model. 

2.8.1  Canonical  models 

The  typical  method  for  reducing  the  number  of  parameters  in  probabilistic  causal 
networks  is  to  decompose  the  relationship  between  an  effect  and  its  set  of  causes 
into  an  expanded  model  with  additional  variables  that  impose  structural  inde¬ 
pendencies.  For  example,  suppose  that  two  binary  variables  A  and  B  causally 
influence  another  binary  variable  E  as  depicted  in  Figure  2.6.  In  an  unre¬ 
stricted  model,  the  conditional  probability  distributions  P{E\a,b),  a  €  {ao,ai} 
and  b  £  {60  require  the  specification  of  four  independent  parameters,  and  the 
distribution  for  the  response-function  variable  Tg,  P{re),  requires  the  specification 
of  15  independent  parameters. 

Suppose,  however,  that  the  interaction  between  these  variables  is  correctly 
modelled  by  a  Noisy-OR  gate.  This  model  imposes  additional  structural  assump¬ 
tions  into  the  causal  network,  as  depicted  in  Figure  2.7.  Two  new  intermediate 
variables,  la  and  Ib,  have  been  introduced  into  the  network;  E  is  functionally 
determined  by  la  and  (E  —  la  \/  h),  and  the  nondeterministic  effects  of  A  on 
la,  and  B  on  B  are  specified  by  the  conditional  probability  distributions  P{Ia\A) 
and  P{Ib\B).  This  Noisy-OR  model  still  requires  the  specification  of  four  inde¬ 
pendent  parameters  in  the  general  case,  but  the  savings  become  apparent  when 
attempting  to  evaluate  counterfactual  probabilities.  From  Theorem  2.6.1,  no 
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Figure  2.7:  Functional  model  assuming  that  the  influence  of  A  and  B  on  E  may 
be  modelled  by  a  Noisy-OR  gate. 

response-function  variable  need  be  generated  for  E,  because  E  is  a,  deterministic 
function  of  its  causal  parents.  Instead,  response-function  variables  are  generated 
for  la  and  /(,,  and  the  distribution  of  the  response  functions,  and  P{ri^)  are 

specified;  however,  this  requires  specification  of  only  six  independent  parameters. 

The  reduction  in  parameters  becomes  even  more  drastic  as  the  number  of 
causal  influences  impinging  on  a  variable  increase.  Consider  the  case  where  n 
binary  variables  Ci,  C2,  •  •  • ,  Cn  influence  another  binary  variable  E.  The  con¬ 
ditional  probability  distributions  P(e|ci,  C2, . . . ,  c^)  are  completely  specified  by 
2"  —  1  independent  parameters.  The  general  functional-model  for  this  pattern  of 
influence  is  depicted  in  Figure  2.8  where  the  distribution  of  response-functions 
P{re)  requires  specification  of  2^^"^  —  1  independent  parameters.  This  super¬ 
exponential  growth  of  parameters  as  a  function  of  the  number  of  causal  influences 
makes  the  task  of  counterfact ual  inference  unmanageable.  It  quickly  becomes  ap¬ 
parent  that  the  number  of  specification  parameters  must  be  reduced  in  order  to 
make  any  headway. 

Suppose  we  assume  that  the  relationship  between  a  variable  and  its  causal 
influences  satisfies  the  temporal  definition  of  causal  independence  [Hec93].  In  this 
case,  the  causal  structure  of  Figure  2.8  may  be  expanded  to  the  structure  depicted 
in  Figure  2.9.  Each  response-function  variable  Tg, ,  . . . ,  re„  specifies  the  mapping 
from  two  binary  variables  to  a  single  binary  variable  (requiring  specification  of 
15  independent  parameters  for  P(re^)),  while  the  prior  distribution  on  is  just 
the  same  as  the  prior  distribution  on  cq.  Therefore,  the  decomposed  model  of 
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Figure  2.8:  Unconstrained  model  with  n  known  variables  influencing  the  variable 
E. 

Figure  2.9  requires  the  specification  of  only  15n  +  1  independent  parameters,  a 
drastic  reduction  from  the  2^^")  —  1  required  for  the  unconstrained  functional 
model  of  Figure  2.8. 

Of  course,  a  decomposed  model  should  only  be  applied  when  it  is  believed  that 
it  provides  a  good  approximation  of  the  relationships  existing  in  the  real  world. 
However,  from  the  comparison  of  parameter  counts,  it  is  apparent  that  the  full 
response-function  distribution  P{re)  could  be  pragmatically  unmanageable,  and 
require  the  use  of  a  decomposed  model  in  order  to  make  any  progress. 

2.8.2  Linear-Normal  Models 

Assume  that  knowledge  is  specified  by  the  structural  equation  model  (often  used 
in  econometrics  and  the  social  sciences,  and  originally  established  by  Sewall 
Wright  in  his  development  of  path  analysis  models  [Wri21]) 

X  =  Bx  -f  e 

where  H  is  a  matrix  (not  necessarily  triangular)  corresponding  to  a  causal  model 
(possibly  cyclic),  and  we  are  given  the  mean  and  covariance  of  the  dis¬ 
turbances  e  (assumed  to  be  normal).  The  variables  on  the  right-hand  side  of 
a  structural  equation  axe  interpreted  as  the  causal  influences  of  the  variable  on 
the  left-hand  side  of  the  equation.  The  mean  and  covariance  of  the  observable 
variables  X  are  then  given  by: 

fl.  =  Sfl,  (2.8) 

S,,,  =  (2.9) 
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Figure  2.9:  Canonical  model  assuming  temporal  causal  independence. 
where  S  =  (I  —  B)~^. 

Under  such  a  model,  there  are  well-known  formulas  [Whi90,  p.  163]  [Dem69] 
for  evaluating  the  mean  and  covariance  of  X  conditioned  on  some  observations 
o: 


f^x\o  —  I^X  ^x, 0^0,0^^  ^^o) 

■^XjX\o  ^XfX 


(2.10) 

(2.11) 


where,  for  every  pair  of  sub-vectors,  Z  and  W,  of  X,  is  the  sub-matrix  of 
with  entries  corresponding  to  the  components  of  Z  and  W.  Singularities  of 
S  terms  are  handled  by  appropriate  means. 

^  Similar  formulas  apply  for  the  mean  and  covariance  of  X  under  an  intervention 
d.  For  mathematical  convenience,  let  X  be  partitioned  according  to  whether  each 
variable  is  referred  to  in  d.  The  set  of  variables  referred  to  in  a  is  denoted  by  Z, 
and  the  set  of  remaining  variables  in  X  is  denoted  by  Y.  Under  this  partition, 
the  matrix  B  can  be  partitioned  into  four  submatrices 


^yy  Byz 
Bzy  Bzz 

B  is  replaced  by  the  intervention-pruned  matrix  B  =  [6,j]  defined  by: 


0  if  Xi  G  d 
bij  otherwise 
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Equivalently, 


B 


^yy  ^y^ 

0  0 


According  to  intervention  semantics  [Pea94a]  all  links  from  to  Z  are  severed 
and  Z  is  forced  to  the  value  a.  Therefore,  the  modified  structural  equation  model 
for  X  when  influenced  by  external  interventions  is  given  by 


X  =  (I-B) 


-1 


0 


0 

— * 
a 


Given  the  mean  and  covariance  of  Cj,,  the  mean  and  covariance  of  the  observable 
variables  X  may  be  evaluated 


f^x\d  — 


f^z\d 

{I  —  Byy)  ^  Byz^z) 


az 

^x,x\d  —  ^ 

-^yzjyz\d 

'^y,z\d 

^Zyz\d 

(2.12) 


(I  -  0 

0  0 


(2.13) 


To  evaluate  the  counterfactual  distribution  iXx*\a*o  S'Hd  we  first  update 

the  prior  distribution  of  the  disturbances  by  their  distribution  conditioned  on  the 
observations  o: 


P'c  ~  ^e|o  ~  +  ^£,0^00(0  Jio) 

^e,e  ~  ^e,elo  ~  ~  ^£,0^0,0 

where  So  is  the  submatrix  of  S  containing  all  columns  of  5,  but  only  those  rows 
corresponding  to  the  observed  variables  in  o. 

We  then  evaluate  the  means  Jlx*\a*o  and  variances  Sj;._a..|a*o  of  the  variables 
in  the  counterfactual  world  {X*)  under  the  intervention  a  using  Eqs.  (2.12)  and 
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(2.13),  by  replacing  the  prior  distribution  on  the  disturbances  and  with 
the  posterior  distribution  E°  ,  and  u°  : 

ey,Cy  rCy 


{I  -Byy)  +By,a,) 

{I  -  Byy)-^I;l,^{{I  -  Byy)-y  0l 


(2.14) 

(2.15) 


It  is  clear  that  this  procedure  can  be  applied  to  non-triangular  matrices,  as 
long  as  S  is  non-singular.  An  application  of  this  class  of  model  representation 
will  be  presented  in  Chapter  8. 


2.9  Conclusion 

In  this  chapter  we  have  presented  formal  notation,  semantics,  a  representation 
scheme,  and  inference  algorithms  that  facilitate  the  probabilistic  evaluation  of 
counterfactual  queries.  World  knowledge  is  represented  in  the  language  of  mod¬ 
ified  causal  networks,  whose  root  nodes  are  unobserved,  and  correspond  to  pos¬ 
sible  functional  mechanisms  operating  among  families  of  observables.  The  prior 
probabilities  of  these  root  nodes  are  updated  by  the  factual  information  trans¬ 
mitted  with  the  query,  and  remain  fixed  thereafter.  The  antecedent  of  the  query 
is  interpreted  as  a  proposition  that  is  established  by  an  external  intervention, 
thus  pruning  the  corresponding  links  from  the  network  and  facilitating  standard 
Bayesian- network  computation  to  determine  the  probability  of  the  consequent. 

The  algorithm  has  not  been  implemented,  but,  given  a  subjective  prior  distri¬ 
bution  over  the  response  variables,  there  are  no  new  computational  tasks  intro¬ 
duced  by  this  formalism,  and  the  inference  process  follows  the  standard  techniques 
for  computing  beliefs  in  Bayesian  networks  [Pea88].  If  prior  distributions  over  the 
relevant  response-function  variables  cannot  be  assessed,  there  are  methods  that 
use  the  standard  conditional-probability  specification  of  Bayesian  networks  to 
compute  upper  and  lower  bounds  on  counterfactual  probabilities.  Chapter  3  will 
formally  develop  these  methods. 

The  semantics  and  methodology  introduced  in  this  chapter  can  be  adopted  to 
nonprobabilistic  formalisms  as  well,  as  long  as  they  support  two  essential  com¬ 
ponents:  abduction  (to  abduce  plausible  functional  mechanisms  from  the  factual 
observations)  and  causal  projection  (to  infer  the  consequences  of  the  intervention¬ 
like  antecedent).  We  should  note,  though,  that  the  license  to  keep  the  response- 
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function  variables  constant  stems  from  a  unique  feature  of  counterfact ual  queries, 
where  the  factual  observations  are  presumed  to  occur  not  earlier  than  the  coun- 
terfactual  intervention.  In  general,  when  an  observation  takes  place  before  an  in¬ 
tervention,  constancy  of  response  functions  would  be  justified  if  the  environment 
remains  relatively  static  between  the  observation  and  the  intervention  (e.g.,  if  the 
disturbance  terms  e,-  represent  unknown  pre-intervention  conditions).  However, 
in  a  dynamic  environment  subject  to  stochastic  shocks  a  full  temporal  analysis  us¬ 
ing  temporally-indexed  networks  may  be  warranted  or,  alternatively,  a  canonical 
model  of  persistence  should  be  invoked  [Pea93d]. 
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CHAPTER  3 


Bounding  counter  factual  probabilities 

3.1  Introduction 

In  Chapter  2,  an  algorithm  was  presented  for  evaluating  the  unique  quantitative 
solutions  to  counterfactual  queries  when  a  functional  model  is  available.  How¬ 
ever,  it  is  rare  that  there  is  sufficient  knowledge  about  a  system’s  underlying 
mechanisms  to  generate  a  complete  functional  model.  This  chapter  is  concerned 
with  the  evaluation  of  counterfactual  probabilities  when  this  model  is  incomplete. 

Section  3.2  describes  how  counterfactual  probabilities  may  be  uniquely  ex¬ 
pressed  in  terms  of  a  functional  model’s  distribution  of  response-functions.  In 
Section  3.3  we  will  describe  how  these  response-function  distributions  are  con¬ 
strained  by  a  probabilistic  specification  over  the  observable  variables  in  the  sys¬ 
tem,  and  how  the  expression  for  the  counterfactual  probability  may  be  minimized 
and  maximized  over  these  constraints.  When  the  expression  to  be  optimized  is 
a  linear  function  of  the  response-function  distributions,  the  evaluation  of  bounds 
on  the  counterfactual  probability  may  be  guaranteed;  however,  as  will  be  demon¬ 
strated  in  Section  3.4  many  counterfactual  probabilities  are  polynomial  functions 
of  the  response-function  distributions  in  which  case  the  potential  for  local  op¬ 
tima  usually  means  that  determination  of  bounds  is  not  guaranteed.  Finally,  in 
Section  3.5  we  demonstrate  that  marginalization  of  variables  from  a  probabilistic 
causal  model  prior  to  evaluating  bounds  on  counterfactual  probabilities  lead  to 
looser  bounds  than  if  the  analysis  were  performed  on  the  original  model. 

3.2  Expressing  counterfactual  probabilities  in  terms  of 
response-function  distributions 

Given  the  functional  specification  of  a  causal  system  as  described  in  Section  2.3, 
we  can  derive  an  expression  for  a  counterfactual  probability  P(c*|d*,o)  in  terms 
of  the  underlying  functional  model’s  parameters. 

Let  r  =  . . .  ,rj;„)  represent  the  set  of  response-function  variables  for 
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the  corresponding  observable  variables  in  the  model.  Given  the  value  of  r,  all 
variables  Xj  €  X  are  functionally  determined  according  to  the  recursive  function: 


Xi  =  gx,{r) 

~  fxi  {dui  {x).,  gu2  (r),  •  •  ■ ,  guk  (^')) 


where  pa(Xi)  =  {Ui,U2,  ■  ■  ■  ^Uk}  C  X  are  the  causal  influences  of  Xi  in  the 
model. 

If  a  set  of  variables  A  C  X  in  the  model  are  externally  forced  to  the  value 
a,  then  according  to  the  intervention-based  semantics  of  [Pea93a],  the  recursive 
function  becomes 


Xi 


Xi  if  Xi  €  A 

<  fxi{rxi)  if  Xi  ^  A  and  pa(Xi)  =  0 
,  fxi (gt,  {r),gt2 (r),  •  •  • , gt (r)> )  otherwise 


The  counterfactual  probability  P(c*|d*,o)  may  be  rewritten 


P(c*ld*,o) 


P{c*,o\a*) 

P{o\a*) 


Since  an  intervention  can  only  affect  its  descendants  in  the  graph  [Pea94b]  we  have 
P{o\a)  =  P{o)  which  is  readily  computed  from  the  probabilistic  specification. 

P{c*,o\a*)  may  be  evaluated  in  terms  of  the  functional  model  by  summing 
the  probabilities  of  the  response-function  configurations  which  are  consistent  with 
the  arguments  (c*,  a*,  o).  Formally, 


P{c-,o\a-)  =  X:P(r) 

reR 


where 


R  =  {r|V:ri6o[</xi(r)=x,]  and  V^.ec*[^xj.(r)]=3:*} 

Hence,  the  counterfactual  probability  may  be  written  in  terms  of  the  structure 
{pa(x,)}  and  parameters  P(r)  of  the  functional  model: 

P(c*|a*,o)  =  (3-1) 
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The  next  section  will  describe  how  the  right-hand  side  of  Eq.(3.1)  may  be  op¬ 
timized  subject  to  the  constraints  imposed  by  the  probabilistic  specification.  But 
first,  we  will  derive  an  expression  for  the  probability  that  Bob  would  have  fired  his 
rifle,  if  the  Captain  were  to  have  given  the  order  to  shoot,  given  that  the  Captain 
gave  the  order  to  release  the  traitor  and  Bob  did  not  shoot  [cj,  cq,  bo)). 

The  connection  between  the  factual  and  counterfactual  worlds  was  discussed 
in  Chapter  2  where  it  was  argued  that  the  response-function  variables  should 
assume  the  same  values  in  both  worlds.  For  the  firing-squad  example,  this  invari¬ 
ance  allows  the  response  function  variables  Vc  and  to  be  shared  between  the 
networks  corresponding  to  the  two  worlds  (see  Figure  3.1). 


Figure  3.1:  Factual  {C,B)  and  counterfactual  {C*,B*)  worlds  for  the  functional 
analysis  of  the  structure  C  —*  B.  The  response-function  variables  rc  and  ri, 
(summarizing  all  exogenous  influences  on  C  and  B)  attain  the  same  value  in  the 
real  and  counterfactual  worlds. 

The  domain  of  B's  response-function  variable  rj  is  defined  by  Eq.  (2.3),  while 
the  response-function  vaxiable  for  C  has  a  two-valued  domain  Tc  G  {0, 1}  with 
the  following  functional  specification: 

C  =  fc{rc)  =  hc,r^{)  (3.2) 

where  the  mappings  defined  by  each  response  function  hc,rc{)  given  by 

he, of)  =  Co 
hc,i()  —  Cl 
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One  quickly  notes  that  the  prior  probability  distribution  on  will  be  the  same  as 
the  distribution  over  C  (this  is  true  in  general  for  variables  that  were  root  nodes 
in  the  original  causal  structure): 


From  Eq.  (3.1), 


where 


P(r,=0)  =  Pico) 
PirM)  -  P{ci) 


Pibllc^coM  = 


Erefl-P(r) 

E(co,6o) 


R  =  {{rc,rb)\gcirc,rb)=Co/\9birc,n)=boAgl^irc,n)=bi} 
The  only  tuple  satisfying  the  condition  in  Eq.  (3.3)  is 

{rcn)  =  (0,1) 

Therefore, 

P(rc=0)P(r6=l) 


(3.3) 


/>(6;|c;,ci,,6o)  = 

But,  F’(rc=0)  =  P(c=co),  hence, 

P{bl\cl,co,bo)  - 


P{co,bo) 

Pirb=l) 

P{bo\co) 


3.3  Constraints  and  optimization 


The  probabilistic  specification  P(3:,|pa(a:,))  for  a  complete  model  imposes  a  set 
of  constraints  on  the  distribution  of  response  functions  P{rxi )  of  the  form 


P(xi|pa(xi))  =  ^P(r:t,Jt(rj,,.;a:i,pa(xi)) 


(3.4) 


where  the  characteristic  function  t  indicates  which  response  functions  r^-  map 
the  given  value  of  X,’s  causal  influences  pa(x,)  to  Xi’s  given  value  Xj,  i.e. 

t(r  -x-Darx-B  -  I  ^  if^i  =  ^(pa(x,),r,J 

t(r,„x„pa(x.))  -  <  Q  otherwise 
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For  an  incomplete  model,  where  X,-  and  Xj  are  assumed  to  have  an  exoge¬ 
nous  common  cause,  the  common  constraint  for  these  two  variables  will  be  given 
instead  by 

P(xi,a:j|pa;f_^;f^}(a:i),pax-{Xi}(^i))  =  (3-5) 

^  /  Pij'xi'i  )  pa(3Jj)) 

Txi  yTxj 

Note  that  the  constraints  in  Eq.  (3.5)  are  linear  in  P(rj;;, 

As  an  example,  the  constraints  in  the  firing-squad  story  (which  is  complete 
with  two  binary  variables  C  and  B)  are  given  by 

P(6i|co)  =  p{n^2)  +  P{n^)  (3.6) 

P(6i|ci)  =  P(r6=l) -h  P(r5=3)  (3.7) 

P{ci)  =  P(r,=l)  (3.8) 


Given  the  entire  set  of  linear  constraints  and  the  objective  function  from 
Eq.  (3.1),  the  bounds  may  be  evaluated  using  techniques  for  optimizing  non-linear 
objective  functions  under  linear  constraints  [Sca85].  In  general,  the  optimization 
procedure  may  converge  to  a  local  minima/maxima  which  would  produce  false 
bounds.  If  the  objective  is  to  prove  that  the  counterfactual  probability  falls  within 
a  certain  range,  care  must  be  taken  to  ensure  that  global  optima  are  found. 

If  the  objective  function  given  by  Eq.  (3.1)  is  linear,  the  minimum/maximum 
may  be  determined  using  linear  programming  techniques.  In  this  case,  when 
the  problem  size  is  small  enough,  we  may  also  derive  closed-form  bounds  to  the 
counterfactual  probability  in  terms  of  the  probabilistic  specification.  This  is  ac¬ 
complished  by  enumerating  the  vertices  in  the  dual  linear  programming  problem 
(see  Appendix  B). 

For  the  firing-squad  example,  the  symbolic  expression  for  the  counterfactual 
probability  P{bl\cl,co,bo)  may  be  optimized  over  the  space  of  linear  constraints 
given  by  Eqs.  (3.6)-(3.8).  The  resulting  symbolic  bounds  are: 


P{bo\co) 


max 


I  P(6i|ci)-P(6i|co)  I 

<P(6*|cI,Co,6o)< 


1 


P{bo\co) 


mm  ■ 


P{bo\co) 

P{bM 
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By  substituting  the  known  conditional  probabilities  from  Section  2.3 

^(^>o|co)  =  0.90 
P{hx\ci)  =  0.90 

we  can  evaluate  the  numeric  bounds  on  the  counterf actual  probability: 

8/9<P(6^|ct,co,6o)<l 

Sometimes,  one  may  feel  confident  in  claiming  that  additional  constraints 
should  be  imposed  on  the  parameters  defining  the  distribution  of  response- functions. 
For  example,  we  may  subjectively  believe  that  Bob  never  confuses  the  shoot  and 
release  signals,  which  is  simply  written  P(rj,  =  2)  =  0  and  added  to  the  exist¬ 
ing  set  of  constraints.  The  optimization  of  the  expression  for  the  counterfact ual 
probability  then  proceeds  as  before.  In  this  case,  this  assumption  is  sufficient  to 
uniquely  determine  the  counterf  act  ual  probability 

P{b\\c\,coM  =  8/9 

This  shows  that  partial  knowledge  or  belief  about  the  distribution  of  response- 
functions  is  an  important  technique  for  tightening  bounds  on  counterfactual  prob¬ 
abilities  given  only  a  probabilistic  specification  of  observable  variables. 


3.4  Nonlinear  expressions 

Unfortunately,  a  closed- form  expression  for  the  counterfactual  probability  is  not 
always  a  linear  function  of  the  parameters  of  the  response-function  distributions. 
This  will  be  demonstrated  by  the  following  example  which  relates  to  the  firing- 
squad  example  previously  discussed. 

First,  the  original  model  will  be  expanded  by  incorporating  the  additional 
knowledge  that  there  is  only  one  other  rifleman,  Dave,  whose  tendency  to  fire  is 
independent  of  Bob’s  firing  given  the  Captain’s  signal.  The  causal  structure  for 
this  model  is  depicted  in  Figure  3.2.  The  story  relating  Dave’s  firing  habits  will 
be  similar  to  Bob’s  habits  described  in  Section  1.2.  D  is  a  deterministic  function 
of  C ,  and  D’s  response-function  variable  rd 

d  =  fd{c,rd)  =  hd,raic)  (3.9) 

where 

/id,o(c)  =  do  (3.10) 
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Figure  3.2:  Causal  structure  reflecting  the  influence  that  the  Captain^s  signal  has 
on  Bob  and  Dave’s  firing ^  and  the  influence  that  their  firing  has  on  the  Traitor’s 
health. 


hd,i{c)  =  1 

f  do 
[  do 

if  c  =  Co 
if  c  =  Cl 

(3.11) 

hd,2{c)  =  j 

1  do 
[  do 

if  c  =  Co 
if  c  =  Cl 

(3.12) 

(3.13) 

In  addition,  T  is  now  a  deterministic  function  of  D,  and  T’s  response-function 
variable  rt 

t  =  ft(b,d,rt)  =  ht,r,{b,d)  (3.14) 

where 

ht,o(^by  d)  —  io 

ht,i{b,d)  =  I 
h,,2{b,d)  =  I 

htAb,d)  = 
ht,4{b,d)  =  I 
ht,,ib,d)  = 


■d{b,d):^{bud^) 
i{{b,d)  =  {b4,d4) 

if  (6,d)/(6o,cii) 
■dib,d)  =  {bo,dr) 
ii  d  =  do 
ii  d  =  di 

ii{b,d)^{b4,do) 
if  {b,d)  =  {bi,do) 

iib=  bo 
if  6  =  6i 
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d) 

htAb,d) 

d) 

b't,io(,b,  c?) 
b't,\\{bf  d^ 
ht,i2{b,  d) 
ht,i3{b,  d) 

/it, 14(6,  d) 
b't,i5i^b,  </) 


f  to  if  {b,d)  G  {{bo,do),{bi,di)} 

f  to  if  (6,c?)  =  {bo,  do) 
pi  ii{b,d)^{bo,do) 

r  ^0  if  {b,d)  ^  {bo,  do) 

\  fi  i{  {b,d)  =  {bo,do) 

I  to  if  {b,d)  e  {{b,,do),{bo,d,)} 

\  ti  if  {b,d)  6  {(6o,</o),(/»i,</i)} 
f  ^0  ifb  =  bi 

\  fi  if  6  =  60 

j  to  if{b,d)  =  {bi,do) 

\fi  i^{b,d)^{b^,do) 

{to  if  c  =  di 

ti  if  c  =  do 

(  to  if  {b,d)  =  {bo,di) 

if  {b,d)  ^  {bo,  di) 

f  to  if  {b,d)  =  (6i,<ii) 

\h  if  {b,d)^  {bud,) 


Suppose,  that  we  observe  the  Captain  give  the  signal  to  shoot  (ci),  Bob  fires 
his  rifle  {b,),  and  the  Traitor  survives  (fi).  If  the  Captain  had  not  given  the  signal 
to  shoot,  what  is  the  probability  that  the  Traitor  would  have  died  (to),  i-e.,  what 
is  P(t5|c5,ci,/>i,ti)? 


The  instantiated  graphical  structure  for  evaluating  this  conditional  probability 
is  shown  in  Figure  3.3. 


According  to  the  procedure  described  in  Section  3.2,  we  can  write  the  prob¬ 
ability  of  the  counterfactual  consequent  in  terms  of  the  response-function  distri¬ 
butions  P{rc),  P{rb),  P{rd),  P{rt)‘ 
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Figure  3.3:  To  evaluate  the  counterfactual  probability  P{tQ\cQ,ci,bi,ti),  the  com¬ 
bined  functional  model  (factual/ counterfactual  worlds)  is  instantiated  with  obser¬ 
vations  ci,bi,ti  and  intervention  Cq  (links  pointing  to  Cq  are  severed). 


P{tl\cl,cubi,U)  =  (3.15) 

F’(r(i=0)  I3fee{4,5,6,7}  P{f't=k)-\- 

p(r^=l\  P(^d=l)T.ke{i,3,5,7}P{rt^^)+ 

^  P(r<i=2)  X)fc6{4,5,12,13} + 

P{bi,ti\ci)  .  P{‘^d=^)J2ke{l,5,9,13}P{'''t=^) 

P|'„  o\  -P(»'o;=l)  Efce-(1,3,9,11}  P{rt=k)+ 

[  ’[Pird=2)Eke{i,e,i2M}Pirt=k)  J  J 

with  the  following  constraints  over  the  response-functions’  distributions: 

P(r(,=0)  -I-  P{rb=l)  =  P(6olco) 

P(r,=2)  +  P(ri=3)  =  P{br\co) 

P{rb=0)  +  P{rb=2)  =  P(6o|ci) 

P(r,=l)  +  P(r,=3)  =  P(6i|ci) 
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P{rd=0)  +  P(r,=l) 

= 

P((io|co) 

P{r,=2)  +  P(r,=3) 

P{di\co) 

P(rrf=0)  +  P(r,=2) 

P{do\ci) 

P{r^=l)  +  P{r,=3) 

P(di|cx) 

P{rt=i) 

j€{0,1,2,3,4,5,6,7} 

P{to\bo,  do) 

P{rt=i) 

se{8, 9, 10, 11, 12, 13, 14, 15} 

P{ti\bQ,  do) 

E  'P(’'.=*) 

«e{0,l,4,5,8,9,12,13} 

= 

Pi^ol^Q,  di) 

S  P{rt=i) 

te{2,3,6,7,10,ll,14,15} 

P{t\ |6o*  di) 

E  ^(’■.=0 

*€{0,1,2,3,8,9,10,11} 

= 

P{io\bi,do) 

H  P{rt=i) 

*€{4,5,6,7,1.2,13,14,15} 

P{t\ |6i,  do) 

^  P{rt=i) 

*€{0,2,4,6,8,10,12,14} 

P{to\bi,  di) 

E 

*'€{1,3,5,7,9,11, 13,15} 

P{ti\bi,  di) 

Eq.  (3.15)  is  a  polynomial  expression  of  the  response-function  variables,  and 
is  therefore  not  directly  amenable  to  the  linear-optimization  procedure  detailed 
in  Appendix  B.  As  an  alternative,  one  could  apply  techniques  for  optimizing 
nonlinear  functions  in  a  convex  polytope  [Sca85];  however,  there  is  no  guaran¬ 
tee  that  the  global  optima  will  be  found  by  these  procedures,  so  care  must  be 
taken  in  interpreting  the  results.  In  Chapter  4,  we  will  find  that  global  optima 
are  guaranteed  when  computing  counterfactual  beliefs  in  an  order-of-magnitude 
probability  calculus. 

There  are  some  cases,  however,  where  a  polynomial  expression  is  amenable 
to  linear  optimization,  because  the  expression  may  be  manipulated  into  a  form 
where  linear  sub-expressions  may  be  optimized  independently.  Once  these  sub¬ 
expressions  are  optimized,  then  their  optimal  values  may  be  substituted  into  the 
original  expression,  and  the  procedure  is  repeated  until  we  are  left  with  a  linear 
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expression  that  is  directly  optimizable. 

Theorem  3.4.1  Several  linear  expressions  fi{x),  f2{x),  . . fn{x)  may  be  inde¬ 
pendently  optimized  if 

Y,optfk{x)  =  opt^fk{x)] 
k  k 

For  example,  suppose  that  we  have  a  model  for  our  domain  containing  three 
binary  variables  A,  B,  and  C,  with  the  structure  A  B  C  and  the  conditional 
probability  distributions  P{a),  P{b\a),  and  P{c\b).  We  make  the  observation 
{ao,  Co}  and  then  wish  to  know  P(c}|dj,  ao,  cq),  i.e.,  the  probability  that  C  would 
have  been  Ci,  if  A  were  ai.  Figure  3.4  shows  the  structure  of  the  functional  model 
corresponding  to  the  probabilistic  specification. 


Figure  3.4:  Bayesian  model  for  evaluating  counterfactual  queries  when  the  causal 
structure  is  given  by  A  B  C . 


C  is  a.  deterministic  function  of  B  and  r,,,  and  5  is  a  deterministic  function 
of  A  and  rb  in  the  complete  model.  In  terms  of  this  models’  response-function 
distributions,  we  may  express  the  counterfactual  probability: 


l,  Uq,  Cq}  — 


P(^rb=l)P{rM)  +  P{rb=2)P{r,=2) 

PicoWo) 


(3.16) 
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This  expression  is  nonlinear  with  respect  to  the  response-function  distributions 
P{rb)  and  P(rc);  however,  applying  Theorem  3.4.1  shows  that  P{rb  - 1)  and 
P(rj,=2)  may  be  optimized  independently. 


max 


{  P{h 


max  ■ 


max  ‘ 


P{h\a,)  -  P{b,\ao)  1 
P{h\ao)  -  P{b^\a^)  j 

<  P(ri,=l)  -I-  P(r6=2)  < 

•^*(60100) -h  P(^|ai)  1 

P(b:lao)  +  P(biiai)  j 

Summing  the  right  hand  side  of  Eqs.  (3.17)  and  (3.18): 

P(^>o|ao)  1  ,  „  /  P(bolai)  )  ^ 

P(6ilao)  J  “ 

P{bo\o-o)  +  P{bo\ai) 

max  <  \ao)  +  P{bi |ai) 

P{bo\ao)  +  Pib^lao)  =  1 

P(^|<*i)  +  P(^i|«i)  =  1 


max  ■ 


P{b^\a^) 


+  max  ■ 


(3.19) 


(3.20) 


But  one  of  the  first  two  terms  in  the  right  hand  side  of  Eq.  (3.20)  must  be  greater 
than  or  equal  to  one,  while  the  other  term  is  less  than  or  equal  to  one;  therefore, 
the  expression  reduces  to  the  right  hand  side  of  Eq.  (3.19).  Similar  arguments 
lead  to  the  conclusion  that  the  sum  of  the  left  hand  sides  in  Eqs.  (3.17)  and  (3.18) 
is  equal  to  the  left  hand  side  of  Eq.  (3.19).  The  conditions  in  Theorem  3.4.1  are 
satisfied  allowing  us  to  use  linear  optimization  to  compute  the  bounds  on  the 
counterfactual  probability: 


(P(6i|ai)  -  P(6i|ao))(P(ci|6i)  -  P(ci|6o))  ^ 

(P(6:|ao)  -  P(6i|ai))(P(ci|ao)  -  P(ci|ai))  ^ 

<  P{cl\al,ao,co)  < 

'  P{bo\ao)P{co\bo)  +  P{bo\a^)P{co\b^)  ^ 

1  Pib,\a,)P{co\bo)  +  P{br\ao)P{co\b^) 

P{co\ao)  P(6o|ao)P(ci|6i)  +  P(6o|ai)P(ci|6o) 

.  m|ai)P(ci|6i)-HP(6aMP(ci|6i)  ^ 


T’(colao) 


max  < 


(3.21) 
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This  section  has  shown  that  although  many  interesting  counterfactual  proba¬ 
bilities  are  polynomial  with  respect  to  the  underlying  response-function  distribu¬ 
tion,  and  hence  susceptible  to  the  problem  arising  from  local  optima  within  the 
parameter  space,  there  are  some  cases  where  linear  optimization  is  still  possible 
because  some  of  the  terms  in  the  expression  may  be  independently  optimized,  and 
then  combined  to  form  a  closed-form  expression  for  the  bounds.  This  technique 
will  be  applied  in  Chapter  5  when  deriving  bounds  on  treatment  effects  given  a 
subject’s  category  of  treatment  consumption. 


3.5  Model  marginalization 


In  the  last  section,  we  computed  the  bounds  for  a  counterfactual  probability 
based  on  a  model  containing  three  variables  A,  B,  and  C,  where  B's  value  did 
not  take  part  in  the  specification  of  the  counterfactual  query.  One  might  consider 
marginalizing  B  out  of  the  model,  because  B  is  not  referenced  in  our  observa¬ 
tions  or  the  counterfactual  conditional.  Although  this  is  admissible  when  the 
prior  probabilities  on  the  response-function  variables  P{n)  and  P{rc)  are  known 
(allowing  exact  calculation  of  the  counterfactual  probability),  this  strategy  is  fal¬ 
lible  when  these  distributions  are  unspecified  and  hence  only  bounds  on  the  the 
counterfactual  probability  may  be  computed. 

Figure  3.4  shows  the  structure  of  the  functional  model  corresponding  to  the 
probabilistic  specification.  If  we  marginalize  out  the  variable  J5,  the  structure  of 
the  functional  model  is  given  by  Figure  3.5  (note  that  the  response-function  vari¬ 
able  for  C  in  the  partial  model  is  denoted  by  Sc)-  The  mapping  from  the  complete 
model’s  conditional  probability  specifications  to  the  partial  model’s  specification 
is  given  simply  by 


P(ci|ao)  =  P(ci|6o)P(6o|ao)  +  P(ci|6i)P(6a|ao) 
P{c,\a,)  =  P{c,\ho)P{bo\a,)  +  P{c,\b^)P{b,\a^) 


(3.22) 

(3.23) 


In  the  complete  model,  C  is  a  deterministic  function  of  B  and  Tc,  and  B 
is  a  deterministic  function  of  A  and  rj.  However,  in  the  partial  model,  (7  is  a 
deterministic  function  of  A  and  Sc-  In  terms  of  the  partial  models’  response- 
function  distributions,  we  may  express  the  counterfactual  probability: 

P{s,=l) 


P(ct|at,ao,co)  = 


(3.24) 


P(co|ao) 

Given  an  instantiation  of  B  and  C’s  response-function  distributions,  the  two 
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Figure  3.5:  Partial  model  over  A  and  C . 


counterfact ual  probabilities  given  by  Eqs.  (3.16)  and  (3.24)  will  always  be  equal, 
because  the  numerator  of  Eq.  (3.24)  is  just  the  result  of  marginalizing  out  B. 

However,  in  terms  of  the  partial  model,  the  bounds  on  the  counterfact  ual 
probability  are  derived  as: 


^’(co|ao)  P(ci|ai)  -  P(ci|ao)  } 

<  P(ci|ai,ao,Co)  < 

1  [  P{co\ao)  1 

P(co|ao)"''^i  P(ciK)  I 

which  may  be  expanded  as  follows  using  Eqs.  (3.22)  and  (3.23) 


1 

.f’(colao) 


max 


0 

[Pibo\a^)Pic^\bo)  +  Pih\a^)P{c^\b^)- 

Pibo\ao)P{co\bo)  +  P(6i|ao)P(co|6a)] 

<  P(c*|ai,ao,co)  < 

1  .  f  P(6o|ao)P(col6o)  +  mi«o)F(col&i)  1 

P(co|ao)™''i  P{bo\a^)P{c^\bo)  +  P{b^\a,)P{c^\b^)  J 


Comparing  these  bounds  to  those  computed  with  the  full  model,  Eq.  (3.21),  one 
can  see  that  the  numeric  bounds  evaluated  from  the  partial  model  are  never 
tighter  and  almost  always  looser  than  those  evaluated  from  the  complete  model 
analysis. 
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In  this  section,  we  have  demonstrated  that  unreferenced  variables  may  not  be 
marginalized  out  of  the  probabilistic  causal  model  without  potentially  affecting 
the  evaluation  of  bounds  for  a  counterfactual  probability.  However,  the  bounds 
evaluated  from  the  marginalized  model  still  hold  true  —  they  are  just  not  nec¬ 
essarily  tight.  Therefore,  one  may  still  consider  evaluating  bounds  under  the 
marginalized  model  if  one  cannot  guarantee  that  global  optima  have  been  found 
from  the  analysis  using  the  complete  model. 

3.6  Conclusion 

This  chapter  has  developed  a  procedure  for  evaluating  bounds  on  counterfactual 
probabilities.  The  corner-stone  of  counterfactual  analysis  is  the  use  of  functional 
models  with  response-function  variables,  for  which  the  counterfactual  probability 
may  be  uniquely  written.  The  task  of  determining  bounds  involves  the  optimiza¬ 
tion  of  this  expression  under  the  constraints  imposed  by  the  known  probabilistic 
specification.  In  general,  the  task  is  reduced  to  the  optimization  of  a  polynomial 
function  subject  to  linear  constraints,  which  introduces  the  problem  of  local  min¬ 
ima/maxima.  However,  if  the  counterfactual  probability  is  linear  with  respect  to 
the  functional  specification,  then  the  bounds  are  easily  found  via  linear  program¬ 
ming.  In  addition,  in  some  cases  we  may  be  able  to  derive  closed-form  bounds 
on  counterfactual  probabilities  in  terms  of  the  probabilistic  specification. 
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CHAPTER  4 


Evaluating  counterfactuals  from  k,  rankings: 
Computation  and  Bounds 

4.1  Introduction 

In  Chapters  2  and  3  a  formalism  was  developed  for  evaluating  and  bounding 
counterfactual  probabilities  given  a  causal  structure  of  the  relevant  domain  along 
with  conditional  probabilities  of  each  variable  given  its  set  of  causal  influences. 
Detractors  of  reasoning  with  probabilistic  causal  networks  claim  that  it  is  un¬ 
reasonable  to  assume  that  we  can  obtain  the  numbers  which  parameterize  the 
causal  model,  and  that  we  may  only  elicit  crude  measures  of  belief  from  human 
reasoners.  An  alternative  representation  of  these  belief  measures  is  given  by  an 
order-of-magnitude  abstraction  of  probabilities,  known  as  /c-rankings  [Spo88]. 

In  general,  the  objective  function  to  optimize  for  evaluating  bounds  on  coun¬ 
terfactual  probabilities  will  be  a  polynomial  function  with  respect  to  the  unspec¬ 
ified  prior  probabilities  on  the  response-function  variables.  Therefore,  algorithms 
for  optimizing  cannot  always  verify  that  the  global  minima/maxima  has  been  dis¬ 
covered,  because  the  algorithms  may  terminate  at  local  minima/maxima.  If  we 
cannot  guarantee  global  optima,  then  the  returned  bounds  on  the  counterfactual 
probability  are  too  tight;  and  therefore,  are  not  bounds.  If,  however,  we  represent 
knowledge  by  a  k  ranking  over  the  worlds,  then  we  can  always  evaluate  the  upper 
and  lower  bounds  on  our  belief  in  a  counterfactual  consequent.  Of  course,  this 
only  gives  us  an  approximation  to  the  bounds  that  would  be  determined  using  a 
fully  specified  probabilistic  causal  model. 

In  this  chapter,  we  will  reformulate  the  evaluation  of  bounds  on  counterfactual 
beliefs  in  terms  of  «  rankings  over  possible  worlds.  The  next  section  will  give 
background  on  reasoning  with  k  ranking  functions.  In  Section  4.4,  a  general 
description  of  a  procedure  for  evaluating  bounds  on  counterfactual  beliefs  given  k 
ranking  functions  will  be  given.  Section  4.5  will  demonstrate  this  on  an  example, 
and  finally  Section  4.6  will  give  some  concluding  remarks. 
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4.2  K  rankings 


K  rankings  ([Spo88])  provide  an  order-of-magnitude  abstraction  of  probability 
distributions  that  states  that  if  P{a)  is  of  order  0(e")  for  some  constant  e  less 
than  1  and  non-negative  integer  n,  than  the  k  ranking  of  a  is  K(a)  =  n.  Note  that 
as  a  probability  decreases,  its  k  ranking  will  increase.  This  transformation  from 
probabilities  to  k  rankings  partitions  the  range  of  probabilities  into  equivalence 
classes  designated  by  a  non-negative  integer  that  indicates  how  surprising  a  par¬ 
ticular  event  would  be  («(a)  =  0  indicates  that  event  a  would  not  be  surprising). 

One  of  the  obvious  benefits  of  k  rankings  is  greater  ease  in  specifying  beliefs: 
rather  than  specifying  a  precise  probability,  only  a  crude  estimate  of  the  proba¬ 
bility  is  necessary.  Of  course,  this  also  means  that  the  accuracy  of  the  result  is 
less  precise;  if  you  do  not  have  precise  probabilities  to  begin  with,  the  solution 
should  not  be  expected  to  be  precise. 

The  basic  operators  of  ranking  functions  correspond  very  nicely  with  the  op¬ 
erators  in  probability  theory:  multiplication  and  addition  in  probability  theory 
are  replaced  by  addition  and  minimization,  respectively,  in  k  calculus.  While 
probability  theory  has  the  following  axioms 


P(a)  =  Z  PH 

(4.1) 

w\=a 

P{a)  +  P(-a)  =  1 

(4.2) 

P{a,b)  =  P{b\a)P{a) 

(4.3) 

K  calculus  has  the  corresponding  set  of  axioms 

K(a)  =  min«(t£;) 

(4.4) 

lypa 

min{/c(a),  Ac(-'a)}  =  0 

(4.5) 

K{a,b)  =  «(6|a)  -|-  K(a) 

(4.6) 

We  will  now  use  this  relationship  between  k  and  probability  calculi  to  describe 
the  procedure  for  evaluate  counterfactual  k  rankings. 


4.3  K  ranked  counterfactuals 

Consider  the  causal  structure  C  B  with  an  associated  conditional  kappa  rank¬ 
ing  k(6|c).  Suppose  that  we  have  observed  {co,bo).  What  is  our  belief  that  B 
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would  have  been  equal  to  fei,  if  C  were  Ci.  According  to  the  formalism  for  evaluat¬ 
ing  counterfactual  probabilities  in  Chapter  2,  we  generate  a  functional  model  for 
the  given  causal  structure.  In  this  case  we  introduce  a  response-function  variable 
rb  which  specifies  the  mapping  from  C  to  5  as  follows: 

b  =  fb{c,rb)  ^  hb,rt,{c) 

where 

hb,o{c) 

h6,i(c) 

hb,2{c) 

bb,z{s^ 

The  kappa  ranking  over  these  response  functions  then  parameterizes  the 

model. 

Similar  to  the  strategy  developed  in  3  we  can  write  the  counterfactual  kappa 
ranking  for  our  query  as  follows: 

k(6*|ci,  Co,  bo)  =  k(co,  6o,  -  k(co,  bo) 

=  «(ni)  -  «(^o|co) 

If  the  kappa  ranking  over  the  response-function  variable  rb  is  known,  then  a  unique 
counterfactual  kappa  rank  may  be  computed;  however,  if  this  information  is  not 
available,  then  the  counterfactual  kappa  ranking  may  only  be  bounded  under  the 
constraints  given  by  the  known  kappa  ranking  /c(6|c).  These  constraints  are: 


«:(^>olci)  =  min{/c(r(,o),K(r62)}  (4-7) 

k(&o|co)  =  min{«(r6o),«(r6i)}  (4-8) 

«(^>i|co)  =  min{/c(rM),K(j'i,3)}  (4-9) 

k(6i|ci)  =  min{/c(ri,i),K(r63)}  (4-10) 

>  0  Vi  €{0,1, 2, 3}  (4.11) 


The  formulation  of  the  problem  was  straight-forward  following  the  formal¬ 
ism  of  Chapter  2;  however,  an  appropriate  mechanism  needs  to  be  available  for 


performing  this  integer  optimization  with  constraints  containing  minimization 
operators. 

Eqs.  (4.7)-(4.10)  immediately  imply 


K(rf,o)  >  max{«:(6o|co);  k(6o|ci)}  (4.12) 

K{rbi)  >  max{K(6o|co);  «:(6i|ci)}  (4.13) 

K{rb2)  >  max{K(6i|co);  «:(6o|ci)}  (4.14) 

/{(rfcs)  >  max{/c(6i|co);  «(6i|ci)}  (4.15) 


No  dependencies  exist  among  these  expressions  that  prevent  equality  from  hold¬ 
ing  in  all  these  constraints;  therefore,  finding  the  minimum  for  individual  Ac(rfe) 
terms  is  trivial.  For  our  counterfactual  query,  this  leads  to  a  lower  bound  on  the 
counterfactual  K-ranking; 


When  maximizing  K,{rbi),  there  are  only  two  situations  to  consider:  either 
«(ri,i)  is  completely  unbounded  from  above;  or  the  upper  bound  is  equal  to  the 
lower  bound  in  Eq.  (4.13),  i.e., 

«;(r6i)  =  max{«:(6o|co);  «(6i|ci)}  (4.17) 

To  determine  which  situation  holds,  we  first  remove  /c(r6i)  from  the  minimization 
sets  of  Eqs.  (4.7)-(4.10),  and  check  for  satisfiability.  If  satisfied,  then  /c(r6i)  is 
not  bounded  from  above;  otherwise,  the  bounds  reduce  to  equality 

K(6I|ct,co,6o)  =  maxj  I 


Of  course,  if  the  counterfactual  k  is  equal  to  zero,  then  we  would  like  to  know 
what  the  counterfactual  k  is  for  the  negation  of  the  counterfactual  consequent. 
For  our  example,  we  would  be  interested  in  k(65|cJ,  cq,  6o).  Applying  the  same 
procedure  as  before  we  obtain  the  lower  bound 


«(6o|cJ,co,6o)  >  max 


0 


If  /c(r6o) 


/«(^»o|ci)  -  K{bo\co)  j 
oo  is  not  satisfiable,  then  the  bounds  reduce  to  equality 


«:(6o|ci,co,6o)  =  max 


0 


«(^»o|ci)  -  K{bo\co) 
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4.4  General  case 


4.4,1  Functional  expression 


In  Section  3.2,  we  gave  a  declarative  definition  for  counterfactual  probabilities 
written  in  terms  of  the  structure  {pa(a:,)}  and  the  parameters  of  the  response- 
function  distributions: 


where 


P(c-\a\o) 


J2reR  ) 


P(o) 


R  =  {'■|Vx, 6,(11  = /r,(r))  and  =  jj_.(r)l} 

This  definition  may  be  transformed  according  to  Eqs.  (4.1)-(4.6)  into  an  ex¬ 
pression  written  in  terms  of  the  k  ranking  functions  of  the  response-function 
variables: 


/c(c*|a*,o)  =  min  «:(r)  — /c(o)  (4-18) 

where  the  form  of  /c(r)  is  always  given  by  the  sum  of  k’s  for  each  independent  set 
of  response  function  variables  in  r. 

4.4.2  Constraints 

The  K  rankings  over  the  model’s  observable  variables  K:(a:,|pa(a;,))  impose  a  set 
of  constraints  on  the  k  ranking  over  the  response- function  variables  K(rxi)  of  the 
form 


K:(x,|pa(xi))  =  min{K(r^i)  :  X,- = /„;(pa(xi),r^,.)}  (4.19) 

Similar  to  the  treatment  of  exogenous  common  causes  discussed  in  Section  3.3,  if 
Xi  and  Xj  are  assumed  to  have  an  exogenous  common  cause,  then  the  common 
constraint  for  these  two  variables  will  be  given  instead  by 

K(xi,xj|pa(x.)  -  {xj},pa(xj)  -  {x,}) 

=  min{«:(rs;,i:J  :  x,-  =  and  xj  =  fxj{psi^{xj),rx^)}  (4.20) 

Therefore,  in  general,  we  will  be  optimizing  a  function  with  minimization 
operators  over  constraints  also  containing  minimization  operators. 
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4.4.3  Optimization 


In  this  section  we  will  show  that  optimization  of  the  objective  k  function  (Eq.  (4.18) 
under  the  constraints  of  Eqs.  (4.19)  and  (4.20)  is  trivial  for  the  minimum  value, 
but  requires  either  a  complete  enumeration  or  a  search  procedure  for  determining 
the  upper  bound  on  the  K-ranking. 

4. 4. 3.1  Minimization 

The  constraints  given  by  Eq.  (4.19)  immediately  imply  the  following  lower  bound 
on  the  K  ranking  of  individual  response  functions: 

K’irx.-j)  >  rnax{K{xi\peL{xi))  :  Xi  =  fa:,{pa.{xi),r^.=j)}  (4.21) 

These  lower  bounds  are  obtained  simply  by  substituting  the  known  conditional 
K  rankings  K(a;,|pa(x,))  into  the  right  hand  side  of  Eq.  (4.21).  Given  that  the 
objective  function  consists  only  of  minimization  and  summation  operators,  the 
strict  lower  bound  on  the  k  of  the  counterfactual  can  always  be  evaluated  by 
substituting  in  the  lower  bounds  for  each  ^(rjJ. 

4. 4. 3. 2  Maximization 

In  maximizing  the  objective  function,  it  helps  to  note  that  if  a  response-function 
rank  K{rxi=j)  is  not  forced  to  be  equal  to  its  lower  bound  given  by  Eq.  (4.21), 
then  that  k  term  is  completely  unbounded  from  above,  and  may  be  assumed  to 
be  infinite.  Therefore,  when  we  try  to  maximize  the  objective  function,  we  will 
set  each  response-function  k  to  either  its  lower  bound  value  or  infinite  (which  is 
equivalent  to  removing  every  instance  of  that  k  term  from  the  objective  function. 

This  suggests  a  crude  algorithm  for  evaluating  the  upper  bound  on  the  ob¬ 
jective  function:  simply  evaluate  the  objective  function  for  every  configuration 
of  response-function  /c’s  consistent  with  the  constraints  imposed  by  the  known 
conditional  k  rankings,  and  take  the  maximum.  Besides  the  enumeration  of  every 
configuration,  we  must  also  have  a  means  for  checking  the  consistency  of  each 
configuration.  Although  computationally  expensive,  if  all  configurations  can  be 
enumerated,  we  can  guarantee  that  the  strict  upper  bound  may  be  computed. 

There  are  ways  in  which  we  may  decrease  the  computational  cost  by  per¬ 
forming  some  preprocessing  to  eliminate  a  majority  of  configurations  from  the 
search  space.  In  addition,  the  search  space  can  be  sorted  to  speed  up  the  task.  A 
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formal  algorithm  will  not  be  presented,  but  an  example  demonstrating  the  main 
concepts  of  maximizing  a  counterfactual  k  will  be  discussed  in  the  next  section. 


4.5  Example 

In  Section  3.4,  we  attempted  to  evaluate  a  counterfactual  query  related  to  the 
firing-squad  example.  In  order  to  evaluate  bounds  on  the  counterfactual  prob¬ 
ability,  a  polynomial  objective  function  over  a  set  of  linear  constraints  was  to 
be  optimized.  Unfortunately,  methods  for  optimizing  polynomial  functions  are 
plagued  by  the  presence  of  local  minima/maxima  in  the  parameter  space.  How¬ 
ever,  if  the  belief  specification  is  given  in  terms  of  k  rankings,  the  bounds  on  the 
counterfactual  «  ranking  can  be  determined  precisely. 

In  order  to  make  the  k  bounds  on  the  counterfactual  conditional  more  in¬ 
teresting,  the  story  behind  the  causal  structure  of  Figure  3.3  will  be  changed. 
Suppose  there  are  four  individuals,  Carol,  Bob,  Dave,  and  Tina,  with  a  known 
pattern  of  party  attendance.  The  variables  C,  B,  D,  and  T  indicate  whether  each 
individual  attended  the  party,  respectively;  the  values  ci,  6i,  di,  and  U  indicating 
that  the  individuals  were  at  the  party,  while  cq,  bo,  do,  and  to  indicating  that  the 
individuals  were  not  at  the  party. 

Bob  really  dislikes  parties  so  almost  never  attends  them,  but  if  Carol  is  there 
he  is  slightly  more  likely  to  be  there  than  if  Carol  is  not  at  the  party.  This  can 
be  modelled  in  k  rankings  as  follows: 

«(&o|co)  =  0  «(6o|ci)  =  0 

k(^>i|co)  =  2  «:(6i|ci)  =  1 

Dave,  though,  loves  parties  so  he  almost  always  attends  them.  However,  if 
Carol  is  there  he  is  a  little  less  likely  to  be  there  than  if  Carol  is  not  there.  This 
can  be  modelled  as  follows: 

«(do|co)  =  3  /c(do|ci)  =  1 

K(di|co)  =  0  «:(di|ci)  =  0 

Tina  is  a  friend  of  Bob  and  Dave  and  is  not  very  excited  about  going  to 
parties.  She  also  knows  that  Bob  and  Dave  get  into  scuffles  when  they  get 
together;  therefore,  Tina  typically  will  not  go  to  parties  if  both  Bob  and  Dave 
are  going  to  be  there.  The  k  ranking  representing  this  information  is  given  by 

^(tol^o^do)  =  0  K(to\biido)  —  4 
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K(ii|6o,(io)  =  2 
K{to\bo,di)  =  3 

K{ti\bo,di)  -  0 


K{ti\bi,do)  =  0 
K{to\bi,di)  =  0 

«:(<i|6i,c?i)  =  1 


Each  variable  is  a  deterministic  function  of  its  observable  causal  influences  and 
its  response-function  variable  according  to  Eqs.  (3.2),  (2.3),  (3.9),  and  (3.14). 

Suppose  that  we  observe  Carol  at  the  party  (ci).  Bob  at  the  party  (6i), 
and  Tina  at  the  party  (ti).  If  Carol  were  not  at  the  party  (cq),  how  surprised 
would  we  be  to  see  Tina  absent  from  the  party  (to)?  In  other  words,  what  is 

«;(to|cS,Ci,6i,ti)? 

The  instantiated  graphical  structure  for  evaluating  this  k  ranking  is  the  same 
as  that  depicted  in  Figure  3.3. 

According  to  the  procedure  described  in  Section  4.4.1,  we  can  write  the  k  rank 
of  the  counterfact ual  consequent  in  terms  of  the  response-function  K-rankings 
K{rc),K{n),K{rd;),K{rt): 


n{to\cl,ci,bi,ti) 

«(r6=l)  -I-  min  I 


mm  < 


«:(r5=3)  -|-  min 


-«:(6i,ti|ci)4- 

«(r<f=0)  +  min{K(rt=j)|;  e  {4, 5, 6, 7}} 
K(rrf=l) -h  min{K(r<=i)|j  e  {1,3, 5, 7}} 
K{rd=2)  +  min{K(rt=i)i;  €  {4, 5, 12, 13}} 
«:(r<i=3)  4- min{/c(r(=i)|j  €  {1,5,9, 13}} 
K{rci=l)  -4  min{«;(ri=i)|j  €  {1,3,9,11}} 
K{rd=2)  +  mm{K{rt=j)\j  e  {4,6,12,14}} 


(4.22) 


over  the  response-functions’ 

K-ranking: 

min{/c(r6=0),/c(ri,=l)}  = 

«(&o|co) 

min{«:(r6=2),  K(rb=3)}  = 

«(6i|co) 

min{Ac(r6=0),/c(r6=2)}  = 

/c(6o|ci) 

min{K(ri,=l), /c(ri=3)}  = 

K(fel|Ci) 

min{K(rc=0), /c(rc=l)}  = 

«(co|co) 

min{Ac(rc=2),  Ac(rc=3)}  = 

«(ci|co) 

1 1 
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min{«:(rc=0), /c(rc=2)}  =  /c(co|ci) 

min{K(rc=l), /c(rc=3)}  =  «:(ci|ci) 

min{K(rt=i)|*  €  {0, 1,2, 3,4, 5,6, 7}}  =  K{to\bo^do) 
min{/c(rt=i)|i  G  {8,9,10,11,12,13,14,15}}  =  K{h\bo,do) 

min{«:(rt=i)|i  G  {0, 1,4, 5, 8, 9, 12, 13}}  =  K{to\bo,di) 

min{/c(r4=i)li  G  {2,3, 6, 7, 10, 11, 14, 15}}  =  k(^i1&0)<^i) 

min{/c(rt=i)li  G  {0,1,2,3,8,9,10,11}}  =  K{tQ\bi,do) 
min{/c(r(i=i)|i  G  {4,5, 6, 7, 12, 13, 14, 15}}  =  K(ti\bi,do) 

min{/c(r4=i)li  G  {0,2, 4, 6,8, 10, 12, 14}}  =  k(<o|&i,c?i) 
min{K(rt=i)|i  G  {1,3,5,7,9,11,13, 15}}  =  «(ii|6i,c?i) 

We  can  simplify  these  constraints  by  the  following  procedure.  For  K.{rx=i),  find 
the  set  of  constraints  containing  K{rx=i)  with  the  maximum  right  hand  side.  Then 
for  all  other  constraints  over  K(a;|pa(x))  eliminate  /c(rx=i)  from  each  constraint’s 
minimization  set.  Applying  this  procedure  reduces  the  above  constraints  to: 

/c(ri=0)  =  0 

min{/c(rfe=2), /c(ri=3)}  =  2 

K(r6=l)  =  1 

min{/c(rd=0),  K(rd=l)}  =  3  (4.23) 

/c(r<i=3)  =  0 

K{rd=2)  =  1 

K(n=6)  =  0 
min{/c(rt=14),  K(r4=15)}  =  2 
min{K(rf=4), /c(rt=5), /c(rt=12), /c(rt=13)}  =  3 
min{/c(ri=i)li  G  {0,1,2,3,8,9,10, 11}}  =  4 

K{rt=7)  -  1 

The  conditional  kappa  term  on  the  right-hand  side  of  Eq.  (4.22)  may  be  com- 
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puted  by  substituting  the  conditional  kappa  rankings  specified  at  the  beginning 
of  this  section  into  the  following  equation: 


«;(&i|ci)  +  min 
2 


«(do|ci)  +  1 

i^{di\ci)  +  K{ti\bi,di)  j 


Figure  4.1  represents  the  structure  of  Eq.  (4.22)  and  will  be  used  to  represent 
the  search  state  (not  the  search  tree)  for  finding  the  upper  bound  on  the  k  ranking 
of  the  counterfactual.  Earlier,  we  mentioned  that  each  response-function  k  value 
is  either  constrained  to  be  equal  to  its  lower  bound  or  completely  unconstrained. 
Therefore,  each  edge  in  the  tree  is  either  assigned  to  its  minimum  value  or  set  to 
oo.  The  K  ranking  of  the  counterfactual  for  these  values  of  the  response-function 
k’s  is  given  by  the  minimum  sum  of  k  terms  over  all  paths  from  the  root  to  any 
leaf  node. 

To  start  the  search  procedure,  we  assign  every  response-function  k  to  its 
minimum  value  as  given  by  the  right-hand  side  of  Eq.  4.21.  This  assignment 
will  never  violate  the  constraints  imposed  by  the  conditional  k  rankings  on  the 
observable  variables. 


Figure  4.1:  Initial  representation  of  maximization  search  state. 

We  then  evaluate  the  «  sums  along  each  directed  path.  Taking  the  minimum 
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of  each  sum  gives  us  the  lower  bound  on  the  k  ranking  of  the  counterfact ual,  and 
this  also  gives  us  a  starting  point  on  the  upper  bound.  The  key  to  the  algorithm 
is  that  the  only  way  that  the  upper  bound  can  be  greater  than  the  current  upper 
bound,  is  when  response-function  k  values  may  be  set  to  infinite  along  all  directed 
paths  with  the  next  higher  k  sums. 

Consider,  then  our  current  example.  Figure  4.2  shows  all  directed  paths  with 
the  minimal  k  sum  equal  to  1.  In  order  to  extend  the  bound  we  must  be  able  to 
sever  this  path  by  assigning  one  of  the  response-function  k’s  to  infinite.  Those 
edges  along  the  path  that  we  immediately  know  cannot  be  set  to  infinite  because 
of  the  simplified  constraints  are  represented  with  dcished  lines.  /c(r(,=3),  though, 
may  be  set  to  infinite  without  violating  any  constraints.  Therefore,  we  know  that 
the  upper  bound  on  the  counterfactual’s  k  ranking  is  greater  than  1. 


-«(d,6i|ci)  =  -21 


Figure  4.2;  Representation  of  maximization  search  state  after  severing  all  kappa 
1  paths. 

The  next  step  is  to  consider  all  directed  paths  whose  k  sums  are  less  than 
or  equal  to  the  next  potential  upper  bound  —  in  this  case  2.  These  paths  are 
shown  in  Figure  4.3.  For  this  particular  example,  k^q,  /Cjs,  and  Km  must  also 
be  set  to  oo  to  sever  all  paths  with  k  sums  less  than  or  equal  to  2.  This  set  of 
k's  may  be  set  to  infinite  without  violating  the  given  constraints;  therefore,  the 
counterfactual  k  ranking  must  be  greater  than  2. 
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-/£(<!,  6i|ci)  =  -2 


Figure  4.3:  Representation  of  maximization  search  state  after  severing  all  kappa 
2  paths. 

Again  we  consider  all  directed  paths  whose  k  sums  are  less  than  or  equal  to 
the  next  potential  upper  bound  —  now  3.  These  paths  are  shown  in  figure  4.4. 
In  order  to  sever  the  left-most  three  directed  paths,  both  K{rd=0)  and  «;(rrf=l) 
must  be  set  to  infinite.  However,  this  violates  the  constraint  given  by  Eq.  (4.23). 
Therefore,  it  is  impossible  to  sever  all  directed  paths  with  k  sums  less  than  or 
equal  to  3,  leading  us  to  the  conclusion  that  the  upper  bound  on  K(tQ|cQ,  Ci,  6i,  ti) 
is  3.  Combined  with  the  earlier  results  for  the  lower  bound,  the  range  of  the 
counterfactual’s  K-ranking  is  given  by 

1  <  K(to|Co,Ci,6i,^i)  <  3 


4.6  Conclusion 

In  this  chapter  we  have  presented  a  method  for  evaluating  bounds  on  beliefs 
in  counterfactuals  when  our  general  knowledge  is  given  by  order-of-magnitude 
abstractions  of  probability  distributions.  Where  evaluating  bounds  on  counter- 
factual  probabilities  may  not  succeed  because  of  the  presence  of  local  optima  in 
the  response-function  parameter  space,  we  can  always  guarantee  that  the  upper 
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-k(<i,6i|ci)  =  -2 


55  23665333333  23266664143 

Figure  4.4:  Representation  of  maximization  search  state  showing  that  all  kappa  3 
paths  may  not  be  simultaneously  severed. 

and  lower  bounds  of  the  counterfactuals  k  ranking  may  be  found  given  sufficient 
time.  The  lower  bound  on  the  k  ranking  ( “we  would  be  at  least  as  surprised  as” ) 
can  be  evaluated  almost  directly  once  we  have  the  counterfactual’s  k  rank  written 
in  terms  of  the  response-functions’  k  rankings.  For  the  upper  bound  (“we  would 
be  at  most  as  surprised  as”),  an  informal  algorithm  was  presented  through  an 
example. 
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Part  III 


Applications 
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CHAPTER  5 


Clinical  trials  with  imperfect  compliance 

5.1  Introduction 

Consider  an  experimental  study  where  random  assignment  has  taken  place  but 
compliance  is  not  perfect  (i.e.,  the  treatment  received  differs  from  that  assigned). 
It  is  well  known  that  under  such  conditions  a  bias  may  be  introduced,  in  the 
sense  that  the  true  causal  effect  of  the  treatment  may  deviate  substantially  from 
the  causal  effect  computed  by  simply  comparing  subjects  receiving  the  treatment 
with  those  not  receiving  the  treatment.  Because  the  subjects  who  did  not  comply 
with  the  assignment  may  be  precisely  those  who  would  have  responded  adversely 
(positively)  to  the  treatment,  the  actual  effect  of  the  treatment,  when  applied 
uniformly  to  the  population,  might  be  substantially  less  (more)  effective  than  the 
study  reveals. 

In  an  attempt  to  avert  this  bias,  economists  have  devised  correctional  formu¬ 
las  based  on  an  “instrumental  variables”  model  ([BT84])  which,  in  general,  do 
not  hold  outside  the  linear  regression  model.  A  recent  analysis  by  [EF91]  de¬ 
parts  from  the  linear  regression  model,  but  still  makes  restrictive  commitments 
to  a  particular  mode  of  interaction  between  compliance  and  response.  [Rob89] 
and  [Man90]  derived  nonparametric  bounds  on  treatment  effects  using  differ¬ 
ent  techniques;  however  their  bounds  are  not  tight.  [H0I88]  has  given  a  general 
formulation  of  the  problem  (which  he  called  “encouragement  design”)  in  terms 
of  Rubin’s  model  of  causal  effect  and  has  outlined  its  relation  to  path  analysis 
and  structural  equations  models.  [AIR93],  also  invoking  Rubin’s  model,  have 
identified  a  set  of  assumptions  under  which  the  “Instrumental  Variable”  formula 
is  valid  for  certain  subpopulations.  These  subpopulations  cannot  be  identified 
from  empirical  observation  alone,  and  the  need  remains  to  devise  alternative, 
assumption-free  formulas  for  assessing  the  effect  of  treatment  over  the  popula¬ 
tion  as  a  whole.  In  this  chapter,  we  derive  bounds  on  the  average  treatment  effect 
that  rely  solely  on  observed  quantities  and  are  universal,  that  is,  valid  no  matter 
what  model  actually  governs  the  interactions  between  compliance  and  response. 


82 


The  canonical  partial-compliance  setting  can  be  graphically  modeled  as  shown 
in  Figure  5.1. 


Latent 

Factors 


Observed 

Response 


Figure  5.1:  Graphical  representation  of  causal  dependencies  in  a  randomized  din 
ical  trial  with  partial  compliance. 


We  assume  that  Z,  D,  and  Y  are  observed  binary  variables  where  Z  represents 
the  (randomized)  treatment  assignment,  D  is  the  treatment  actually  received, 
and  Y  is  the  observed  response.  U  represents  all  factors,  both  observed  and 
unobserved,  that  may  influence  the  outcome  Y  and  the  treatment  D.  To  facilitate 
the  notation,  we  let  z,  d,  and  y  represent,  respectively,  the  values  taken  by  the 
variables  Z,  D,  and  T,  with  the  following  interpretation:  z  G  {zo,Zi},  Zi  asserts 
that  treatment  has  been  assigned  (zq,  its  negation);  d  €  {do,di},  di  asserts 
that  treatment  has  been  administered  (do,  its  negation);  and  y  €  {j/o,  J/i},  yi 
asserts  a  positive  observed  response  (j/o,  its  negation).  The  domain  of  U  remains 
unspecified  and  may,  in  general,  combine  the  spaces  of  several  random  variables, 
both  discrete  and  continuous. 

The  graphical  model  reflects  two  assumptions  of  independence: 

1.  The  treatment  assignment  does  not  influence  Y  directly,  but  only  through 
the  actual  treatment  D,  that  is, 

ZJ_Y\{D,U]  (5.1) 

In  practice,  any  direct  effect  Z  might  have  on  Y  would  be  adjusted  for 
through  the  use  of  a  placebo. 

2.  Z  and  U  are  marginally  independent,  that  is,  Z  ||  U.  This  independence  is 
partly  ensured  through  the  randomization  of  Z,  which  rules  out  a  common 
cause  for  both  Z  and  U.  The  absence  of  a  direct  path  from  Z  to  U  represents 
the  assumption  that  a  person’s  disposition  to  comply  with  or  deviate  from  a 
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given  assignment  is  not  in  itself  affected  by  the  assignment;  any  such  effect 
can  be  viewed  as  part  of  the  disposition. 

These  assumptions  impose  on  the  joint  distribution^  the  decomposition 

P{y,d,z,u)  =  P{y\d,u)  P{d\z,u)  P{z)  P{u)  (5.2) 

which,  of  course,  cannot  be  observed  directly  because  C/  is  a  latent  variable. 
However,  the  marginal  distribution  P{y,  d,  z)  and,  in  particular,  the  conditional 
distributions  P{y,d\z),z  G  are  observed,  and  the  challenge  is  to  assess 

the  causal  effect  of  D  on  Y  from  these  distributions.^ 

In  addition  to  the  independence  assumption  above,  the  causal  model  of  Fig¬ 
ure  5.1  reflects  claims  about  the  behavior  of  the  population  under  external  in¬ 
terventions.  In  particular,  it  reflects  the  assumption  that  P{y\d,u)  is  a  stable 
quantity:  the  probability  that  an  individual  with  characteristics  U  =  u  given 
treatment  D  =  d  will  respond  with  Y  =  y  remains  the  same,  regardless  of  how 
the  treatment  was  selected  —  be  it  by  choice  or  by  policy.  Therefore,  if  we  wish  to 
predict  the  distribution  of  Y  under  a  condition  where  the  treatment  D  is  applied 
uniformly  to  the  population,  we  should  calculate 

P{y*\d*)  =  Eu[P{y\d,u)]  (5.3) 

=  (5-4) 

U 

Likewise,  if  we  are  interested  in  estimating  the  average  change  in  Y  due  to 
treatment,  we  define  the  average  causal  effect,  ACE(T)  Y)  ([Hol88]),  as 

ACE(Z)^T)  =  E4P(i/iK,u)-P(j/i|do,«)]  (5.5) 

=  P(yX\i{)  -  P(rM)  (5.6) 

For  uniformity  of  notation,  we  can  define,  in  an  analogous  way,  the  average  causal 
effects  of  the  assignment  Z,  ACE(Z  Y)  and  ACE(Z  — »■  D).  However,  since  Z 
is  assigned  at  random,  these  two  quantities  can  be  obtained  from  the  observed 
distribution: 


ACE(Z^Il)  =  P{dr\zi)  -  P{di\zo)  (5.7) 

ACE(Z-^Y)  =  P(yilzi)  -  P(yjlzo)  (5.8) 

^We  take  the  liberty  of  denoting  the  prior  distribution  of  [/  by  P(u),  even  though  [/  may 
consist  of  continuous  variables. 

^In  practice,  of  course,  only  a  finite  sample  of  P(j/,  dlz)  will  be  observed,  but  since  our  task 
is  one  of  identification,  not  estimation,  we  make  the  large-sample  assumption  and  consider 
P(j/,  d\z)  as  given. 
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The  task  of  causal  inference  is  then  to  estimate  or  bound  the  expression  in 
Eq.  (5.6),  given  the  observed  probabilities  P(y,d|zo)  and  P{y,d\zi). 

[Pea93b,  Rob89,  Man90]  have  derived  bounds  on  the  two  terms  on  the  right 
hand  side  of  Eq.  (5.6)  given  the  distribution  over  T,  D,  and  Z: 

max[P(t/i,di|4ri);  P{yi,di,\zo)\ 

<  E[P{yi\di,u)]  < 

1 -max[P(yo,di|zo);  P{yQ,di\zi)]  (5.9) 

max[P(t/i,doko);  P(yi,  do,  j^i)] 

<  E[P{yi\do,u)]  < 

1 -max[P(yo,do|.2o);  P{yQ,do\zi)]  (5.10) 

Choosing  appropriate  terms  to  bound  the  difference 

E[Piyi\dr,u)]-E[Piy,\do,u)] 

we  obtain  lower  and  upper  bounds  on  the  causal  effect  of  Z)  on  F: 

Piyi,di\zi)  +  P{yo,do\zo)  -  1 

<  ACE(Z)  ^Y)< 

^  -  P{yo,di\zi)  -  P{yi,do\zo)  (5.11) 

or,  alternatively, 

ACE(T>^r)  >  ACE{Z-^Y)-P{y^,do\z^)-P{yo,di\zo)  (5.12) 

ACEiD-^Y)  <  ACE(Z^F)  +  P(2/o,do|zi)  +  P(2/i,di|zo) 

Due  to  its  simplicity  and  wide  range  of  applicability,  we  will  call  the  bounds  of 
Eq.  (5.12)  the  natural  bounds  (three  other  less  intuitive  expressions  for  the  up¬ 
per  and  lower  bounds  may  be  inferred  from  Eqs.  (5.9)  and  (5.10),  but  these  will 
not  be  presented  here  because  they  will  be  derived  in  Section  5.2).  The  natural 
bounds  guarantee  that  the  causal  effect  of  the  actual  treatment  cannot  exceed 
that  of  the  intent-to-treat  by  more  than  the  sum  of  two  measurable  quantities, 
P{yi,do\zi)  -|-  P{yo,di\zo)‘,  they  also  guarantee  that  the  causal  effect  of  treat¬ 
ment  cannot  drop  below  that  of  the  intent-to-treat  by  more  than  the  sum  of  two 
other  measurable  quantities,  P(t/o^  dol^i)  -b  P{yi,  d\  \zq).  The  width  of  the  natural 
bound,  not  surprisingly,  is  given  by  the  rate  of  defection,  P(di|2:o)  +  P{dQ\z\). 
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Before  continuing  to  the  more  refined  derivation  of  bounds  on  ACE(£)  Y), 
we  should  point  out  that  the  structural  model  of  Figure  5.1  imposes  definite  con¬ 
straints  on  the  observed  distributions  P{y,d\zo)  and  P{y,d\zi).  The  constraints, 
obtained  directly  from  Eq.  (5.2),  are 

^{yo,di\zo)  +  P{yi,di\zi)  <  1 

Piyo,di\zi)  +  P{yi,di\zo)  <  1 
Piyoidolzo)  P{yi,do\zi)  <  1 

P{yo,do\zi)  +  P(yi,do\zo)  <  1  (5.13) 

These  constraints  constitute  necessary  and  sufficient  conditions  for  a  marginal 
probability  distribution  P{y,d,z)  to  be  generated  by  the  structure  of  Figure  5.1 
(proof  in  Appendix  A.l),  and  therefore  they  may  serve  as  an  operational  test  for 
the  compatibility  of  that  structure  with  the  observed  data. 

5.2  Tight  bounds  on  average  causal  effect  of  treatment 

Strict  bounds  on  the  causal  effect  of  treatment  received  on  subject  response  may 
be  derived  by  following  the  procedure  detailed  in  Section  3.3  where  the  objec¬ 
tive  function  to  be  optimized  is  the  difference  between  the  two  counterfactual 
probabilities  on  the  right-hand  side  of  Eq.  (5.6). 

5.2.1  Response-function  model 

First,  the  functional  model  corresponding  to  the  probabilistic  model  of  Figure  5.1 
must  be  specified.  For  each  of  the  observable  variables  in  the  model  {Z,  D, 
and  Y),  we  define  the  corresponding  response-function  variables  (r^,  and  ry, 
respectively). 

Figure  5.2  shows  the  graphical  representation  of  the  resulting  functional  model. 
Because  D  and  Y  are  assumed  to  be  influenced  by  an  unobservable  common  cause, 
the  response-function  variables  and  Vy  are  connected  by  an  edge. 

The  states  of  the  variables  and  ry  have  the  following  interpretations: 

D  is  a  deterministic  function  of  the  variable  Z  and  G  {0, 1, 2, 3}: 

d  =  fd{zyrd)  =  (5.14) 

where 

hd,o{z)  =  do 
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Figure  5.2:  A  structure  equivalent  to  that  of  Figure  5.1  but  employing  re¬ 
sponse-function  variables  r^,  rj,  and  ry. 


^2(2) 

Similarly,  F  is  a  deterministic  function  of  D  and  ry  €  {0, 1, 2, 3}: 

y  ~  ~  ^!/,ry(d) 

where 


do  \i  z  =  zq 

di  if  z  =  zi 

di  if  2  =  zq 

do  if  z  =  zi 


hyfl{d) 

^y,2(d) 

^y,z{d) 


yo 

(  yo  if  d  =  do 
\  yi  if  d  =  di 
(  yi  if  d  =  do 
\  yo  if  d  =  di 
yi 


(5.15) 


The  correspondence  between  the  states  of  variables  and  ry  and  the  potential 
response  vectors  in  the  Rubin’s  model  [RR83]  is  rather  transparent:  each  state 
corresponds  to  a  counterfactual  statement  specifying  how  a  unit  in  the  population 
(e.g.,  a  person)  would  have  reacted  to  any  given  input.  For  example,  ri  —  \ 
represents  units  with  perfect  compliance,  while  rj  =  2  represents  units  with 
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perfect  defiance.  Similarly,  Ty  =  1  represents  units  with  perfect  response  to 
treatment,  while  Ty  =  0  represents  units  with  no  response  (y  =  yo)  regardless 
of  treatment.  The  counterfactual  variables  Yi  and  Yq  usually  invoked  in  Rubin’s 
model  can  be  obtained  from  Vy  as  follows: 


Yi=  {Y\{D  =  di] 
Fo=  {Y\iD  =  dQ] 


if  Ty  =  1  or  Ty  =  3 
otherwise 

if  Ty  =  2  or  Ty  =  3 
otherwise 


In  general,  treatment  response  and  compliance  attitudes  may  not  be  inde¬ 
pendent,  hence  the  arrow  rj,  Ty  in  Figure  5.2.  The  joint  distribution  over 
'I'd  X  ^3/  requires  15  independent  parameters,  and  these  parameters  are  sufficient 
for  specifying  the  model  of  Figure  5.2, 

P{y,d,z,rd,ry)  =  P{y\d,ry)P{d\rd,  z)P{z)P{rd,ry) 

because  Y  and  D  stand  in  functional  relation  to  their  parents  in  the  graph.  The 
causal  effect  of  the  treatment  can  now  be  obtained  directly  from  Eqs.  (5.4)  and 
(5.15)  according  to  Eq.  (3.1),  giving 

Piy'M)  =  P{r,=l)  +  P{r,=3)  (5.16) 

Piy'iK)  =  />(r-,=2)  + /'(r,=3)  (5.17) 

and 

ACE(T)  -^Y)  =  F(ry=l)  -  P(ry=2)  (5.18) 


5.2.2  Linear  programming  formulation 

In  this  section  we  will  explicate  the  relationship  between  the  parameters  of  the  ob¬ 
served  distribution  P{y,  d\z)  and  the  parameters  of  the  joint  distribution  P{r,  r') 
of  the  potential-response  functions.  This  will  lead  directly  to  the  linear  con¬ 
straints  needed  for  minimizing/maximizing  ACE(T>  Y)  given  the  observation 
P{y,d\z). 

The  conditional  distribution  P{y,d\z)  over  the  observable  variables  is  fully 
specified  by  eight  parameters,  which  will  be  notated  as  follows: 

Poo.o  =  P{yo,do\zo) 

Poi.o  =  P{yo,di\zo) 
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PlO.O 
Pi  1.0 


=  P{yi,do\zo) 

=  P{yi,di\zo) 

Poo.i  =  P{yo,do\zi) 

Poi.i  =  P{yo,di\zi) 

PlO.l  =  ^(yi,C^o|2l) 

Pii.i  =  ^(yi,<^i|2i) 

The  probabilistic  constraints 

11 

Pn.o  —  1 

n=00 
11 

S  Pn-l  =  1 

n=00 

further  imply  that  p  =  (poo.o,Poi,o,Pio.o,Pii.o,Poo.i,Poi.i,Pio.i,Pii.i)  can  be  spec¬ 
ified  by  a  point  in  six-dimensional  space.  This  space  will  be  referred  to  as  P. 
Eqs.  (5.7)  and  (5.8)  may  be  rewritten  in  terms  of  these  parameters  as 

ACFi{Z  ^  D)  =  Pii.i -b  poi.i  —  Pii.o  ~  Poi.o  (5.21) 

ACE(Z F)  =  pii.i  +  PlO.l —Pii.o —Pio.o  (5.22) 

The  joint  probability  over  R  x  R',  P{r,r'),  has  16  parameters  and  completely 
specifies  the  population  under  study.  These  parameters  will  be  notated  as 

qjk  =  Pir  =  rj,r'  =  r'i^) 

where  j,  k  G  {0, 1, 2, 3}.  The  probabilistic  constraint 

3  3 

=  1 

j=0  k=0 

implies  that  q  —  (^oo,  9oi)  9o2, 9o3)  9io,  9ii)  9i2, 9i3, 920, 92i5  922, 923, 930, 931)  932>  933) 
specifies  a  point  in  15-dimensional  space.  This  space  will  be  referred  to  as  <5- 

Eq.  (5.18)  can  now  be  rewritten  as  a  linear  combination  of  the  Q  parameters: 
ACE(Z)  —>■  Y)  =  9oi  -f  9ii  -b  921  +  931  —  902  —  912  —  922  —  932  (5.23) 


(5.19) 

(5.20) 
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Given  some  point  q  'm  Q  space,  there  is  a  direct  linear  transformation  to  the 
corresponding  point  p  in  the  observation  space  P: 


Poo.o  =  Qoo  +  9oi  +  Qio  +  9ii 

POl.O  =  920  +  922  +  930  +  932 

PlO.O  =  902  +  903  +  9l2  +  9l3 

Pi  1.0  =  921  +  923  +  931  +  933 

Poo.i  =  9oo  +  9oi  +  920  +  921 

Pol.l  =  9l0  +  9l2  +  930  +  932 

PlO.l  =  902  +  903  +  922  +  923 

Pll.l  =  9ll  +  913  +  931  +  933 


(5.24) 

(5.25) 


which  will  sometimes  be  written  in  matrix  form,  p  =  Pq. 

Given  a  point  p  in  P  space,  the  strict  lower  bound  on  ACE(i)  — »•  Y)  can  be 
determined  by  solving  the  following  linear  programming  problem: 

Minimize:  501  +  qn  +  921  +  Qsi  —  902  ~  9i2  ~  922  “  932 
Subject  to: 


=  1 

j=0  k—0 

Pq  =  p  (5.26) 

Qjk  >  0  for  ^  G  {0, 1,2,3} 


5.3  Closed-form  solutions  to  the  linear  programming 
problem 

Given  an  observed  point  p  in  P  space,  Lo^vip)  and  Ud^y{p)^  respectively,  will 
represent  the  strict  lower  and  upper  bounds  on  ACE(Z)  — ^  F)  associated  with  p. 
More  precisely. 


Ld-*y{p)  = 

min  ACE(Z) 

q  s.t.  p=Pq 

^Y) 

(5.27) 

Ud^y{p)  = 

max  _  ACE(Z) 

q  s.t.  p=Pq 

^Y) 

(5.28) 

where  Eq.  (5.23)  gives  ACE(P  Y)  in  terms  of  9. 
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For  every  given  point  p,  the  optimization  above  can  be  executed  using  the 
Simplex  Tableau  algorithm  (see  [DM81]),  which  yields  a  pair  of  numerical  val¬ 
ues  for  Ld^y{p)  and  Ud^y{^-  Fortunately,  the  size  of  the  problem  permits 
a  closed-form  solution  to  be  obtained  by  enumerating  all  vertices  of  the  dual 
linear-programming  problem’s  constraint  polygon  (see  Appendix  B).  This  proce¬ 
dure  leads  to  the  following  bounds: 

Pii.i  +  Poo.o  —  1 
Pii.o  +  Poo.i  —  1 
~Poi.i  ~  Pio.i 

Lo-^Yip)  =  max^  -Poi.o-Pio.o  ^  ^^29) 

Pll.O  —  Pll.l  —  Pio.l  —  Pol.o  —  Pio.o 

Pll.l  ~  Pll.o  —  Pio.o  ~  Pol.l  ~  Pio.l 
Poo.i  ~  Pol.l  ~  Pio.l  ~  Pol.o  ~  Poo.o 
.  Poo.o  ~  Pol.o  ~  PlO.O  ~  Pol.l  ~  Poo.i  . 

1  ~  Pol.l  —  Pio.o 
1  —  Pol.o  —  Pio.l 
Pll.l  +  Poo.i 

Uo-yip)  =  min  ^  Pn^+Poo^  (5  3^^ 

“POl.O  +  Pol.l  +  Poo.i  +  Pll.o  +  Poo.o 

“Pol.l  +  Pll.l  +  Poo.i  +  POl.O  +  Poo.o 
— PlO.l  +  Pll.l  +  Poo.i  +  Pll.o  +  PlO.O 
.  — PlO.O  +  Pll.o  +  Poo.o  +  Pll.l  +  Pio.l  > 

Note  that  the  first  term  in  these  two  expressions  correspond  to  the  natural  bounds 
of  Eq.  (5.11).  Tables  5.1  and  5.2  list  the  regions  of  P  space  for  which  each  of 
the  terms  in  Eqs.  (5.29)  and  (5.30)  represents  the  lower/upper  bound,  respec¬ 
tively.  These  bounds  constitute  substantial  improvement  over  those  derived  by 
Robins  (1989)  and  Manski  (1990),  which  correspond  to  the  four  upper  terms  in 
both  (5.29)  and  (5.30).  The  width  of  these  bounds  cannot  exceed  the  rate  of 
noncompliance,  P[di\zQ)  -t-  P(do|zi). 

We  may  also  derive  bounds  on  the  treatment  responses  under  the  condi¬ 
tion  that  one  treatment  is  uniformly  applied  to  the  population,  by  optimizing 
Eqs.  (5.16)  and  (5.17)  individually  (under  the  same  linear  constraints).  The 
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Conditions 

Ld-*y{p} 

Pll.l  ^  Pll.O 

Poi.i  +  Pio.i  >  Poi.o 

Poo.o  ^  Poo.i 

Poi.o  +  Pio.o  ^  PlO.l 

Pll.l  +  Poo.o  —  1 

Pll.O  >  Pll.l 

Poi.o  +  Pio.o  ^  Poi.i 

Poo.i  >  Poo.o 

Poi.i  +  Pio.l  ^  Pio.o 

Pll.O  +  Poo.i  —  1 

Pll.O  +  Pio.o  ^  Pll.l  >  Pll.O 
Poi.o  +  Poo.o  ^  Poo.i  ^  Poo.o 

— Poi.i  —  Pio.i 

Pll.l  +  PlO.l  ^  Pll.O  ^  Pll.l 
Poi.i  +  Poo.i  >  Poo.o  >  Poo.i 

— Poi.o  —  Pio.o 

Pll.O  >  Pll.l  +  PlO.l 

Poi.i  ^  Poi.o  +  Pio.o 

Pll.O  —  Pll.l  —  PlO.l  ~  Poi.o  —  Pio.o 

Pll.l  ^  Pll.O  +  Pio.o 

Poi.o  >  Poi.i  +  PlO.l 

Pll.l  —  Poi.i  —  PlO.l  —  Pll.O  —  Pio.o 

Pio.o  ^  Poi.i  +  Pio.i 

Poo.i  >  Poi.o  +  Poo.o 

Poo.i  —  Poi.i  —  Pio.i  —  Poi.o  —  Poo.o 

PlO.l  ^  Poi.o  +  Pio.o 

Poo.o  >  Poi.i  +  Poo.i 

Poo.o  ~  Poi.o  ~  Pio.o  “  Poi.i  ~  Poo.i 

Table  5.1:  Lower  bounds  on  ACE(i)  Y)  given  a  point  p  in  the  observation 
space  P. 
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Conditions 

Uv^r(p) 

Poi.l  >  Pol.o 

Pll.l  +  Poo.l  ^  Pll.o 

PlO.O  >  PlO.l 

Pll.O  +  Poo.o  >  Poo.l 

1  •“  POl.l  —  Pio.o 

Pol.o  ^  Pol.l 

Pll.O  +  Poo.o  ^  Pll.l 

PlO.l  ^  Pio.o 

Pll.l  +  Poo.l  >  Poo.o 

1  —  POl.O  —  PlO.l 

Pol.o  +  Poo.o  ^  Pol.l  ^  POl.O 
Pll.O  +  Pio.o  >  PlO.l  >  Pio.o 

Pll.l  +  Poo.l 

Pol.l  +  Poo.l  >  POl.O  ^  Pol.l 
Pll.l  +  PlO.l  ^  Pio.o  ^  PlO.l 

Pll.O  +  Poo.o 

POl.O  >  Pol.l  +  Poo.l 

Pll.l  >  Pll.O  +  Poo.o 

“■POl.O  +  Pol.l  +  Poo.l  +  Pll.O  +  Poo.o 

Pol.l  >  Pol.o  +  Poo.o 

Pll.O  >  Pll.l  +  Poo.l 

“Pol.l  +  Pll.l  +  Poo.l  +  POl.O  +  Poo.o 

Poo.o  ^  Pll.l  +  Poo.l 

PlO.l  ^  Pll.O  +  Pio.o 

“PlO.l  +  Pll.l  +  Poo.l  +  Pll.O  +  Pio.o 

Poo.l  ^  Pll.O  +  Poo.o 

Pio.o  ^  Pll.l  +  PlO.l 

“Pio.o  +  Pll.O  +  Poo.o  +  Pll.l  +  PlO.l 

Table  5.2;  Upper  bounds  on  ACE{D  — »•  Y)  given  a  point  p  in  observation  space 
P. 
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resulting  bounds  are: 


max  < 


Pio.o  +  Pii.o  —  Poo.i  —  Pll.l 

PlO.l 

Pio.o 

Pol.o  +  Pio.o  —  Poo.l  —  Pol.l 

<  P(ylK)  < 

POl.O  +  Pio.o  +  Pio.l  +  Pll.l 
1  ~  Poo.l 
1  —  Poo.o 

Pio.o  +  Pll.o  +  Pol.l  +  Pio.l 


mm  < 


and 


max  < 


:: 


Pll.o 
Pll.l 

~P00.0  ~  Poi.O  +  Poo.l  +  Pll.l 
~P01.0  ~  Pio.o  +  Pio.l  +  Pll.l 

<  P{y;K)  < 

1  -  Poi.l 
1  ~  Pol.O 

Poo.o  +  Pll.o  +  Pio.l  +  Pll.l 
Pio.o  +  Pll.o  +  Poo.l  +  Pll.l 

These  bounds  improve  upon  the  results  of  [Man90].  In  addition,  one  can  prove 
that  these  are  the  tightest  possible  assumption-free  bounds. 


mm  < 


5.3.1  The  positive-eflPects  convention 

To  simplify  the  presentation  of  the  bounds  found  in  the  last  subsection,  we  first 
choose  a  notationaJ  system  in  which  assignment  to  treatment  does  not  reduce  the 
probability  of  treatment  usage  {D  =  d\)  and  of  positive  response  {Y  =  y\).  From 
Eqs.  (5.7)  and  (5.8),  these  conditions  can  be  written  as 

ACE(Z-^D)  >  0 

ACE(Z->T)  >  0 

or,  alternatively. 


Pol.l  +  Pll.l  >  Poi.o  +  Pll.o 
Pio.l  +  Pll.l  >  Pio.o  +  Pll.o 
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The  conjunction  of  these  two  inequalities  will  be  referred  to  as  the  condition 
of  positive  effects.  This  constraint  may  be  imposed  without  loss  of  generality, 
because  the  labels  of  the  variables’  values  can  always  be  swapped  in  such  a  way 
that  the  inequalities  are  satisfied:  if  ACE(Z  — >•  T>)  <  0,  we  swap  do  and  di;  if 
ACE(Z  — >  F)  <  0,  we  swap  j/o  and  yi. 


In  a  notational  system  where  the  condition  of  positive  effects  holds,  the  lower 
and  upper  bounds  on  the  treatment  effect  can  be  simplified  to  read 


Ld^y{p)  =  max< 


Pii.i  +  Poo.o  —  1 

Pii.i  ~  Pii.o  —  Pio.o  ~  Poi.i  —  Pio.i 
—Poi.i  —  Pio.i 
—Poi.o  —  Pio.o 
[  Poo.o  —  Poi.o  ~  Pio.o  —  Poi.i  ~  Poo.i  ) 


(5.31) 


and 


Ud-*y{p)  =  min< 


1  ~  Poi.i  —  Pio.o 
1  —  Poi.o  —  Pio.i 

— Poi.o  +  Poi.i  +  Poo.i  +  Pii.o  +  Poo.o 
Pii.i  +  Poo.i 
Pii.o  +  Poo.o 
I  ~Pio.i  +  Pii.i  +  Poo.i  +  Pii.o  +  Pio.o  J 


(5.32) 


respectively. 


5.3.2  Graphical  presentation  of  the  bounds 

When  compliance  is  perfect  (i.e.,  kC^{Z  —>•£))  =  1),  we  expect  the  causal  effect 
of  the  treatment  to  coincide  with  the  causal  effect  of  the  intent-to-treat,  that  is, 

ACE(T»  ^Y)  =  ACE{Z  Y)  if  ACE(Z  ^  T>)  =  1 

Similarly,  if  all  units  were  to  exhibit  the  same  difference  in  compliance  proba¬ 
bilities,  P{di\zi.,u)  —  P(di|zo, «),  the  celebrated  “Instrumental  Variable”  formula 
applies 

ACE(Z^F)  P{y^\z,)-P{y,\zo) 


ACE(D  ^  F)  = 


(5.33) 


kCE{Z^D)  P{d^\z^)  -  P{dr\zo) 

Here  ACE(jC)  — )■  F)  is  determined  solely  by  ACE(Z  Y)  and  ACE(Z  D).  In 
general,  however,  the  latter  two  parameters  will  not  be  sufficient  to  determine 
ACE(T)  — >■  F)  uniquely;  nevertheless,  they  can  be  used  to  determine  the  range 
within  which  ACE(D  — >•  F)  may  fall. 
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Figure  5.3  plots  Ld^y{^  and  Ud^y{p)  given  ACE(Z  D)  and 
ACE(Z  — >  Y).  The  range  of  ACE(T>  — >  F)  is  quite  wide,  and  is  given  by  the 
simple  formula; 

ACE(^  ^  F)  +  ACE(Z  -^D)-l 
<  ACE(T)  ^  F)  < 

1-|ACE(Z^D)-ACE(Z^F)|  (5.34) 

An  interesting  point  is  that  plotting  the  natural  bounds  given  by  Eq.  (5.12)  as  a 
function  of  ACE(Z  D)  and  ACE(Z  F)  gives  us  precisely  the  same  results 
as  shown  in  Figure  5.3. 

Note  that  the  bounds  Lu-^Yip)  Ud^y{p)  for  a  particular  point  p  in  P 
space  may  be  much  tighter  than  the  bounds  shown  in  Figure  5.3  as  functions  of 
ACE(Z  —>■  D)  and  ACE(Z  — >•  F)  evaluated  at  p.  This  will  be  demonstrated  by 
example  in  Section  5.4. 


5.4  Examples 

At  this  point  it  is  worth  summarizing  by  example  how  the  bounds  of  Eqs.  (5.29) 
and  (5.30)  can  be  used  to  provide  meaningful  information  about  causal  effects. 

Consider  the  Lipid  Research  Clinics  Coronary  Primary  Prevention  Trial  data 
(see  [Pro84]  for  an  extended  description  of  the  clinical  trial).  A  portion  of  this 
data  consisting  of  337  subjects  was  analyzed  in  [EF91]  using  a  model  that  in¬ 
corporated  subject  compliance  as  an  explanatory  variable;  this  same  data  set 
is  the  focus  of  our  analysis.  A  population  of  subjects  was  assembled  and  two 
preliminary  cholesterol  measurements  were  obtained:  one  prior  to  a  suggested 
low-cholesterol  diet  (continuous  variable  C/i);  and  one  following  the  diet  period 
((7/2).  The  initial  cholesterol  level  ((7/)  was  taken  as  a  weighted  average  of  these 
two  mecisures:  (7/  =  0.25(7/i  -t-  0.75(7/2.  The  subjects  were  randomized  into  two 
treatment  groups;  in  the  first  group  all  subjects  were  prescribed  cholestyramine 
{zi),  while  the  subjects  in  the  other  group  were  prescribed  a  placebo  (^o)-  During 
several  years  of  treatment,  each  subject’s  cholesterol  level  was  measured  multiple 
times,  and  the  average  of  these  measurements  was  used  as  the  post-treatment 
cholesterol  level  (continuous  variable  Cp)-  The  compliance  of  each  subject  was 
determined  by  tracking  the  quantity  of  prescribed  dosage  consumed  (continuous 
variable  B). 

In  order  to  apply  our  analysis  to  this  study,  the  continuous  data  obtained 


96 


ACE(D->Y) 


in  the  [Pro84]  study  must  be  transformed  to  binary  variables  representing  treat¬ 
ment  assignment  (Z),  received  treatment  (D),  and  treatment  response  (V').  The 
following  transformation  accomplishes  this  by  thresholding  dosage  consumption 
and  change  in  cholesterol  level: 


'■II 

if  2  =  zo  or  6  <  50 
\i  z  =  zi  and  5  >  50 

(5.35) 

/  yo 
y  =  i 

1  J/i 

if  cj  —  cp  <i  28 
if  C[  —  Cp  >  28 

(5.36) 

This  transformation  reflects  the  assumption  that  a  subject  does  not  receive 
cholestyramine  if  not  assigned  to  the  cholestyramine  treatment  group,  namely, 
P{yo,di\zQ)  =  0  and  P{yi,di\zo)  =  0.  The  threshold  for  dosage  consumption  in 
Eq.  (5.35)  was  selected  as  roughly  the  midpoint  between  minimum  and  maximum 
consumption,  while  the  threshold  for  cholesterol  level  reduction  in  Eq.  (5.36)  was 
selected  at  28  units. 

If  the  data  samples  are  interpreted  according  to  Eqs.  (5.35)  and  (5.36),  then 
the  computed  distribution  over  (Z,  D,  Y)  results  in  the  following  point  in  P 
space^: 


Poo.o 

II 

o 

\zo)  = 

0.919 

Pol.o 

II 

ko)  = 

0.000 

PlO.O 

II 

ko)  = 

0.081 

Pll.O 

=  P{yi,dt 

ko)  = 

0.000 

Poo.i 

=  P{yo,do 

ki)  = 

0.315 

POl.l 

=  Piyo,di 

ki)  = 

0.139 

PlO.l 

II 

ki)  = 

0.073 

Pll.l 

=  P{yi,di 

ki)  = 

0.473 

By  first  computing  the  causal  effects  of  the  intent-to-treat, 

ACE(Z  —*  D)  =  pii.i  +  poi.i  —  Pii.o  ~  Poi.o  =  0.612  (5.37) 

ACE(Z  — >  F)  =  pii.i  +  pio.i  ~  Pii.o  ~  Pio.o  =  0.465 

^We  make  the  large-sample  assumption  and  take  the  sample  frequencies  as  representing 
P{y,d\z). 
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we  can  verify  that  the  condition  of  positive  effects  is  satisfied.  This  justifies  the 
use  of  Eqs.  (5.31)  and  (5.32)  for  evaluating  the  strict  lower  and  upper  bounds  on 
ACE(Z)  — >  Y).  By  computing  the  quantities  required  for  Eq.  (5.31),  we  obtain 


max  < 


Pii.i  +  PoQ.o  —  1 
Pii.i  ~  Pii.o  ~  Pio.o  ~  Poi.i  ~  Pio.i 
— Poi.i  —  Pio.i 
— Poi.o  —  Pio.o 
Poo.o  —  Pol.o  —  Pio.o  ~  Pol.l  —  Poo.l 


Those  needed  for  Eq.  (5.32)  give  us 


Ud-*y(p} 


min  < 


1  “  Poi.i  —  Pio.o  = 
1  ~  Poi.o  —  Pio.i  — 
—Poi.o  +  Poi.i  +  Poo.l  +  Pii.o  +  Poo.o  = 

Pii.i  +  Poo.l  = 
Pi  1.0  +  Poo.o  = 
— Pio.i  +  Pii.i  +  Poo.l  +  Pii.o  +  Pio.o  = 


0.392  ' 
0.180 
-0.212  ^ 
-0.081 
0.384 


0.780  ' 

0.927 

1.373 

0.788 

0.919 

0.796  , 


Accordingly,  we  conclude  that  the  treatment  causal  effect  lies  in  the  range 


0.392  <  ACE(£>  ^  V)  <  0.780  (5.38) 

which  is  rather  remarkable;  the  experimenter  can  categorically  state  that  when 
applied  uniformly  to  the  population,  the  treatment  is  guaranteed  to  improve  by  at 
least  39.2%  the  probability  of  reducing  the  level  of  cholesterol  by  at  least  28  points. 
This  guarantee  does  not  rest  on  any  assumed  model.  Unfortunately,  these  results 
cannot  be  translated  directly  into  a  useful  policy  statement  for  treating  people 
with  high  cholesterol,  because  the  [Pro84]  data  were  obtained  for  continuous  level 
of  dosage  consumed  (D),  while  our  analysis  is  restricted  to  binary  B.  To  infer 
the  behavior  of  the  population  under  uniform  consumption  at  a  specific  level  of 
dosage,  a  model  with  a  continuous  (or  at  least  3-level)  treatment  must  be  studied; 
these  types  of  models  will  be  addressed  in  Chapter  6. 

Note  that  the  bounds  in  Eq.  (5.38)  are  equal  to  the  natural  bounds  given  by 
Eq.  (5.12): 

ACE(n^V)  >  0.465  -0.073  -0.000  =  0.392 
ACE(I)^V)  <  0.465  +  0.315  +  0.000  =  0.780 

It  is  interesting  to  note  that  “naive”  comparison  of  subjects  in  and  out  of  the 
treatment  group  would  predict,  in  this  case,  the  value  of 


E(pi|di)  -  P(j/ifdo)  =  0.662 
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which  demonstrates  the  potential  inaccuracy  in  using  the  mean  difference  for 
evaluating  ACE{D  — >  Y). 

If  ACE(Z  — >  D)  and  ACE(Z  — >■  Y)  are  the  only  quantities  measured,  then  the 
following  bounds  on  ACE(D  Y)  can  be  computed  by  substituting  the  values 
from  Eq.  (5.37)  into  Eq.  (5.34): 

0.077  <  ACE(D  ^Y)<  0.853 

As  noted  in  Section  5.3.2,  these  bounds  are  much  wider  than  those  obtained  in 
Eq.  (5.38),  which  utilized  the  full  information  given  by  P{y,d\z). 

5.5  Tightness  of  the  natural  bound 

Although  the  example  above  shows  no  improvement  over  the  natural  bounds,  the 
next  (hypothetical)  example  will  show  that  in  certain  cases  the  natural  bounds 
can  be  improved  upon  significantly.  Consider  the  following  point  in  P  space: 


Poo.o 

=  E(i/o,  do 

ko) 

=  0.55 

1^01.0 

=  P(yo,  di 

ko) 

=  0.45 

PlO.O 

=  P{yi,do 

ko) 

=  0.00 

Pi  1.0 

=  P{yi,di 

ko) 

=  0.00 

Poo.l 

=  P{yo,  do 

ki) 

=  0.45 

Pol.l 

=  Piyo,di 

ki) 

=  0.00 

PlO.l 

II 

ki) 

=  0.00 

Pll.l 

=  P{yi,di 

ki) 

=  0.55 

Substitution  of  these  parameters  into  Eq.  (5.12)  results  in  the  natural  bounds 

0.10  <  ACE(Z)  ^  y)  <  0.55 

while  the  bounds  resulting  from  the  application  of  Eqs.  (5.29)  and  (5.30)  collapse 
to 


0.55  <  ACE(I>  ^Y)<  0.55 

Obviously,  when  our  goal  is  the  assessment  of  the  treatment  causal  effect,  the 
bounds  obtained  through  linear  programming  can  be  much  more  informative. 
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Interestingly,  a  precise  determination  of  ACE{D  Y)  is  feasible  even  though 
the  compliance  is  low: 


ACE(Z-^D)  =  0.10 

Intuitively,  one  would  expect  that  if  most  subjects  ignore  their  treatment  a.s- 
signment,  the  results  of  the  study  would  be  suspect.  This  intuition  is  partially 
supported  by  Figure  5.3,  which  shows  that  the  feasible  range  of  ACE(Z)  -4  E) 
tends  to  widen  as  ACE(Z  —>■  D)  decreases.  Nevertheless,  the  idiosyncratic  fea¬ 
tures  of  the  data  in  this  example  permit  us  to  determine  precisely  the  causal 
effect.  These  features  also  allow  us  to  precisely  determine  the  distribution  of 
subjects  in  the  population,  in  terms  of  the  subjects’  compliance  and  response 
characteristics. 

The  first  behavior  is  characterized  by  perfect  compliance  with  the  assignment 
along  with  a  perfect  response  pattern  to  the  treatment  received  {y  =  t/i  if  and 
only  if  d  =  dj).  The  second  behavior  is  characterized  by  perfect  defiance  of 
the  assignment  (the  subject  always  chooses  the  treatment  that  is  the  opposite 
of  the  one  assigned)  along  with  a  total  inability  to  respond  positively  to  either 
treatment.  The  strong  and  strange  interactions  between  the  compliance  and 
response  behaviors  implied  by  these  data  would  be  very  uncharacteristic  of  most 
subject  populations. 

In  fact,  we  can  prove  that  there  are  exactly  six  regions  where  the  average 
causal  effect  of  treatment  on  response  is  identifiable  when  no  assumptions  are 
presumed.  This  is  accomplished  by  enumerating  the  conditions  whereby  one  of 
the  lower  bound  terms  in  Eq.  (5.29)  is  equal  to  one  of  the  upper  bound  terms  in 
Eq.  (5.30). 
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Region 

ACE(I>  ^  Y) 

mi^o)  =  o 

P{do\zr)  =  0 

P{yi,di\zi)  -1-  P{yo,do\zo)  -  1 

P{yi,di\zo)  =  0 
P{yo,di\zi)  =  0 

P{yoi  di\zo)  -fi  P{yi,di\zi)  =  1 

P{yQ,do\zo)  +  p{yo,do\zi)  -  P{yo,di\zo) 

P{yi,do\zo)  =  0 
P{yo,do\zi)  =  0 

P{yo,  dol^o)  +  P{yii  dol^^i)  =  1 

P{yi-,di\zi)  -h  P{yi,di\zo)  —  P{yi,do\zi) 

P{do\zo)  =  0 

P{d,\z,)  =  0 

P{yi,di\zo)  -i-  P{yo,do\zi)  -  1 

P{yo,di\zo)  =  0 

P{yi,di\zi)  =  0 
P{yi',di\zo)  +  P{yoi  dil^^i)  =  1 

P{yo,do\zo)  +  p{yo,do\zi)  -  P{yo,di\zi) 

P{yo,do\zo)  =  0 
P{yi,do\zi)  =  0 
P{yi,do\zo)  -f-  R(j/o,do|zi)  =  1 

P{yi,di\zi)  -f-  P(yi,di|zo)  -  ^(^1,^012:0) 

The  entries  in  this  table  indicate  that  precise  determination  of  treatment 
effects  is  feasible  whenever  (a)  the  percentage  of  subjects  complying  with  assign¬ 
ment  Zo  is  the  same  as  those  complying  with  Zi  and  (b)  in  at  least  one  treatment 
arm  d,  y  and  z  are  perfectly  correlated. 

In  this  section,  we  have  shown  that,  in  general,  the  natural  bounds  given  by 
Eq.  (5.12)  may  not  always  be  as  tight  as  the  bounds  given  by  Eqs.  (5.29)  and 
(5.30).  In  the  next  section,  however,  we  will  demonstrate  that  the  natural  bounds 
are  tight  in  two  important  subspaces  of  P:  when  the  data  reveal  treatment  suf¬ 
ficiency  (conditional  independence  between  treatment  assignment  and  treatment 
response  given  treatment  received),  and  when  it  is  reasonable  to  assume  that 
subjects  are  non-defiant. 


5.6  Incorporating  additional  assumptions 

In  this  section  we  will  examine  the  impact  that  various  assumptions  have  on  the 
bounds  for  ACE(D  — >  Y)  and  the  constraints  that  they  place  on  the  observed 
parameters.  The  main  assumptions  to  be  discussed  here  are: 

•  treatment  sufficiency  (conditional  independence  of  treatment  assignment 
and  observed  response  given  treatment  received); 
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•  treatment  sufficiency  together  with  structural  stability; 

•  no  perfectly  defiant  subjects;  and 

•  monotonic  compliance  and  response  behaviors. 

5.6.1  Treatment  sufficiency 

This  subsection  examines  whether  the  presence  of  conditional  independence 
Z  II  Y\D  in  the  data  simplifies  the  formulas  for  the  bounds  on  ACE(Z)  — >  Y). 
In  other  words,  are  any  of  the  expressions  within  the  minimization/maximization 
of  Eqs.  (5.31)  and  (5.32)  eliminated?  The  following  theorem  provides  the  answer 
to  this  question. 

Theorem  5.6.1  If  the  observed  distribution  P{y,d\z)  satisfies  Z  ||  Y\D  and 
the  condition  of  positive  effects,  then  the  natural  bounds  on  ACE(Z)  —>■  Y) 

ACE(T>^F)  >  ACE(Z^y)-P(j/i,doki)-P(?/o,di|^o) 

ACE(T)->r)  <  ACE(Z^r)  +  P(t/o,do|zi)  +  P(yi,di|zo) 


are  tight. 


Proof: 

We  will  show  that  a  set  of  constraints  implied  by  Z  ||  Y\D  and  the  con¬ 
dition  of  positive  effects  are  only  mutually  consistent  with  those  conditions 
in  Tables  5.1  and  5.2  corresponding  to  the  natural  bounds  (the  topmost 
entries). 

First,  assume  that  p  is  strictly  positive. 

By  definition,  Z  ||  Y\D  ii  and  only  if 


P{y\d,zo)  =  P{y\d,zi) 

for  all  y  and  d  such  that  P{d\zo)  >  0  and  P{d\z\)  >  0.  This  may  be  written: 


Pio.o 


Pio.i 


Poo.o  +  Pio.o 

Pll.O 

Pol.o  +  Pi  1.0 


Poo.l  +  Pio.l 
Pll.l 

POl.l  +  Pll.l 


103 


or,  equivalently, 


Poo.l 

II 

o 

b 

PlO.l 

=  -S'pio.o 

Pol.o 

=  Tpoi.i 

Pi  1.0 

=  Tpu,i 

where  S  and  T  represent  the  ratios 

Q  ^ 

Poo.l  _  PlO.l 

kj  — 

Poo.o  Pio.o 

T  = 

POl.O  _  Pi  1.0 

Poi.i  Pll.l 

From  the  condition  of  positive  effects, 


(5.39) 


Pii.i  +  Poi.i  —  Pii.o  —  Poi.o  >  0 
which,  from  Eq.  (5.39),  may  be  rewritten 

(1  -  r)(pii.i +poi.i)  >  0  (5.40) 

This  implies  that  T  <  1. 

Likewise,  we  may  use  the  equalities  in  Eq.  (5.39)  to  rewrite  the  probabilistic 
constraints  given  by  Eqs.  (5.19)  and  (5.20): 

Poo.o  +  Tpoi.i  +  Pio.o  +  Tpii.i  =  1 
‘S'poo.o  +  Poi.i  +  ‘S'pio.o  +  Pii.i  =  1 

Taking  the  difference  of  these  two  equations  gives 


(1  —  *S')(poo.o 

+ 

o 

b 

= 

(1  ~  T')(poi.i  +  Pii.i) 

r  <  1  then  implies  that  S 

< 

1. 

Applying  these  bounds  on 

5 

and  T 

to 

Eq.  (5.39)  results  in  the  constraints 

Poo.o 

> 

Poo.l 

Pio.o 

> 

PlO.l 

Poi.i 

> 

Poi.o 

Pii.i 

> 

Pi  1.0 
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which,  when  conjoined  with  the  conditions  in  Tables  5.1  and  5.2,  reveal 
that  the  only  applicable  bounds  on  ACE(£)  Y)  under  the  assumption  of 
positive  effects  and  conditional  independence  are  the  natural  bounds: 

LD-*Yip)  = 

UdMp)  = 

When  p  is  not  strictly  positive,  we  can  proceed  through  a  similar  exercise 
on  a  case-by-case  basis  and  obtain  identical  results.  We  omit  this  part  of 
the  proof. 


Pii.i  +  Poo.o  —  1 

ACE(Z  -.Y)-  P{yudo\zi)  -  P{yo,di\zo) 

1  ~  Poi.l  —  Pio.o 

ACE(.^  — »•  y)  -j-  P{yo-i  do\zi)  +  P{yi,  d-\\zQ) 


□ 


Figure  5.4  shows  how  the  conditional  independence  tightens  the  lower  bounds 
shown  in  Figure  5.3  when  the  only  information  known  about  the  observed  distri¬ 
bution  is  ACE(Z  — >•  D)  and  ACE(Z  F). 

5.6.2  Treatment  sufficiency  with  structural  stability 

Where  treatment  sufficiency  holds  under  a  variety  of  experimental  conditions,  it 
is  reasonable  to  assume  that  it  is  not  caused  by  incidental  equality  of  parameters, 
but  rather  by  structural  constraints.  This  notion  of  structural  stability  is  indeed 
the  pivotal  assumption  behind  the  causal  inference  methods  of  [PV91,  SGS91], 
namely,  that  every  conditional  independence  shown  in  the  data  must  be  logi¬ 
cally  implied  by  the  decomposition  of  the  joint  probability  distribution  given  by 
Eq.  (5.2)  as  dictated  by  the  graph  structure.  If  this  assumption  holds,  then  the 
data  are  DAG-isomorphic  to  the  graph  structure,  and  all  independence  relations 
may  then  be  tested  by  using  the  d-separation  criterion  ([Pea88]). 

Theorem  5.6.2  If  an  observed  distribution  P{y,d\z)  is  structurally  stable  and 
satisfies  Y  ||  Z\D  and  Y  Z ,  then 

ACE(Z)  F)  =  P{yM  -  P(t/i|do)  (5.41) 


Proof: 


I 
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ACE(D->Y) 


The  antecedent  of  the  theorem  implies  that  Z  and  Y  must  be  d-separated 
given  D  in  the  graph  structure  for  which  the  data  is  DAG-isomorphic. 
Applying  the  d-separation  criterion  to  the  graphical  structure  of  Figure  5.1, 
we  find  that,  given  D,  Z  and  Y  are  dependent  via  the  path,  Z  —  D  —  U  —  Y. 
The  only  way  to  remove  this  dependency  is  to  eliminate  one  of  the  following 
edges:  Z  ^  D,  U  — >  D,  or  17  — >  T.  The  assumption  that  Z  and  D  are 
marginally  dependent  prevents  the  elimination  of  Z  — >  D;  therefore,  the 
antecedent  of  the  theorem  can  only  be  satisfied  if  at  least  one  of  the  edges 
U  ^  D  ox  U  — »•  T  is  eliminated. 

First,  assume  that  17  — +  F  is  eliminated  from  the  graph  structure.  In  this 
case,  P{y\d,u)  =  P{y\d),  which,  when  substituted  into  Eq.  (5.5),  results  in 

ACE(Z)->F)  =  P{yi\di)  -  P{yr\do) 


Next,  assume  that  U  —>■  D  is  eliminated  from  the  graph  structure.  In  this 
case,  we  note  that  P{u)  =  P{u\d),  allowing  the  following  transformations 
of  Eq.  (5.5): 


ACE{D  ^  Y) 


j:iP(u)P(y,\duu)  -  P(u)P(yi\d„,u)] 

u 

Y^[P{u\di)Piyi\di,u)  -  P{u\do)P{yi\do,u)] 

u 

^[P(t/i,u|di)  -  P{yi,u\do)] 

P{yi\di)  -  Piyi\do) 


a 


Notice  that  the  combination  of  structural  stability  and  treatment  sufficiency 
subsumes  the  assumption  of  Eq.  (5.1);  Z  ||  Y\{D,U}  is  no  longer  an  assump¬ 
tion  but  is  implied  by  Z  ||  Y\D,  because,  for  any  set  of  variables  S,  Z  ||  Fl^ 
cannot  hold  if  there  is  a  direct  arc  from  Z  to  F.  Therefore,  when  structural 
stability  holds,  finding  a  variable  Z'  satisfying  Z'  ||  Y\D  and  Z'  ^  Y  permits 
us  to  dispose  of  the  randomized  assignment  altogether  and  infer  causal  effects 
(using  Eq.  (5.41))  in  purely  observational  studies.  Discovering  a  Z'  which  sat¬ 
isfies  these  relationships  may  be  viewed  as  uncovering  a  randomized  experiment 
that  is  conducted  by  Nature  itself,  and  this  is  the  basis  of  the  “virtual  control” 
condition  discussed  in  [PV91]. 
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5.6.3  Non-defiance 


A  subject  is  characterized  as  perfectly  defiant  if  under  either  treatment  assignment 
the  subject  fails  to  comply  with  the  assignment  {d  =  di  if  and  only  if  z  =  20).  In 
terms  of  the  potential-response  model  of  Figure  5.2,  this  behavior  is  specified  by 
rd  —  2  in  Eq.  (5.14).  One  could  imagine  individuals  who  despise  having  decisions 
made  for  them.  It  is  possible  that  the  act  of  assigning  them  to  a  treatment 
will  lead  them  to  evade  that  treatment,  where  alternatively,  they  would  have 
voluntarily  selected  that  treatment.  Consider  a  study  that  involves  observation 
of  draft  status  (Z)  and  military  service  (D)  ([AIR93]).  It  is  conceivable  that  there 
could  be  subjects  who  despise  authority  and  so,  if  drafted,  would  evade  service 
and,  if  not  drafted,  would  volunteer  for  service. 

Alternatively,  there  are  situations  in  which  perfectly  defiant  behavior  would 
be  improbable: 

•  when  subjects  do  not  know  exactly  what  the  two  treatment  options  (zq  and 
2i)  are;  hence,  it  is  beyond  their  means  to  defy  both  treatment  assignments. 

•  when  subjects  know  what  the  two  treatment  options  are,  but  do  not  know 
which  treatment  they  have  been  assigned  (the  procedures  for  receiving  the 
assigned  treatments  are  identical,  as  in  the  use  of  placebo). 

•  when  subjects  know  what  both  treatments  are  and  know  which  treatment 
they  have  been  assigned  but  do  not  have  access  to  both  treatments;  there¬ 
fore,  it  is  beyond  their  means  to  obtain  the  opposite  treatment  under  either 
assignment. 

Drug  studies  often  are  very  likely  to  fit  one  of  these  situations,  especially  since 
a  placebo  is  usually  used  as  the  alternative  treatment  to  the  medication  under 
study,  so  subjects  cannot  easily  determine  which  treatment  they  have  been  as¬ 
signed. 

Based  on  the  applicability  suggested  above,  we  will  define  the  assumption  of 
non-defiance  as  stating  that  there  are  no  perfectly  defiant  subjects  in  a  study. 
This  assumption  is  expressed  by  the  constraint  P{r  =  r2)  =  0,  or  q2j  =  0 
for  j  =  0, . . . ,  3.  Non-defiance  together  with  the  condition  of  positive  effects 
is  equivalent  to  the  assumption  of  “monotonicity”  analyzed  by  [AIR93],  which 
translates  to  the  restriction:  either  P{r  =  r2)  =  0  or  P{r  =  ri)  =  0.  Because  the 
assumption  of  non-defiance  imposes  restrictions  on  the  unobserved  parameters  in 
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Q  space,  it  carries  the  potential  of  improving  the  bounds  on  ACE(Z)  — >  Y)  beyond 
those  of  Eqs.  (5.27)  and  (5.28).  The  following  theorem  refutes  this  possibility. 

Theorem  5.6.3  If  all  subjects  in  a  population  are  non-defiant,  then  the  natural 
bounds  on  ACE(Z?  — ^  Y), 

ACE(Z)^r)  >  ACE(Z^y)-P(yi,do|zi)-P(2/o,diM 
ACE(T>-^F)  <  ACE(Z^F)  +  P(?/o,do|zi)  +  P(yi,di|zo) 


are  tight. 

This  theorem  may  be  proven  by  reapplying  the  linear  optimization  procedure 
detailed  in  Appendix  B  to  the  optimization  problem  given  by  Eq.  (5.26)  with  the 
additional  constraints  q2j  =  0  for  j  =  0, . . . ,  3.  This  procedure  results  in  a  single 
expression  each  for  the  lower  and  upper  bounds,  identical  to  the  natural  bounds 
given  by  Eq.  (5.12). 

It  is  important  to  understand  that  the  non-defiance  assumption  (as  well  as 
that  of  treatment  sufficiency)  does  not  widen  the  bounds  of  Eqs.  (5.27)  and  (5.28) 
to  the  natural  bounds,  but  instead  restricts  the  observation  space  P  to  a  region 
where  the  natural  bounds  are  the  only  applicable  bounds.  Consequently,  the 
assumption  of  non-defiance  is  partly  observable;  if  P{y,d\z)  does  not  satisfy  the 
following  constraints  implied  by  non-defiance 


Poo.o 

> 

Poo.l 

Pol.l 

> 

Pol.o 

PlO.O 

> 

PlO.l 

Pll.l 

> 

Pi  1.0 

then  the  assumption  of  non-defiance  does  not  hold.  To  summarize,  the  assump¬ 
tion  of  non-defiance  provides  no  benefits  over  the  unconditional  bounds  given 
by  Eqs.  (5.29)  and  (5.30);  however,  it  narrows  the  space  of  observation  so  as  to 
render  the  natural  bounds  of  Eq.  (5.12)  realizable. 

5.6.4  Monotonic  compliance  and  response  behaviors 

What  if  we  assume  that  both  compliance  and  treatment  response  behaviors  spec¬ 
ify  monotonic  functions  from  treatment  assignment  to  treatment  consumed  and 
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treatment  consumed  to  treatment  response,  respectively?  This  just  corresponds 
to  the  incorporation  of  two  additional  constraints: 

P{rd  =  2)  =  0 

P{ry  =  2)  =  0  (5.42) 

The  set  of  points  in  P  space  consistent  with  these  assumptions  is  given  by  the 
following  set  of  constraints: 


PlO.l 

< 

Pio.o 

Pol.o 

< 

POl.l 

Pio.o  +  Pll.o 

< 

PlO.l  +  Pll.l 

Pol.o  +  Pll.o 

< 

POl.l  +  Pll.l 

These  last  two  inequalities  just  correspond  to  the  positive  effects  convention 
(ACE(Z  ^  F)  >  0  and  ACE(Z  D)  >  0). 

In  terms  of  the  Q  space  parameters,  the  average  treatment  effect  under  the 
monotonicity  assumption  reduces  to 

ACE(Z)^y)  =  901  +  911  +  931  (5.43) 

If  we  incorporate  the  constraints  given  by  Eq.  (5.42)  and  optimize  the  objec¬ 
tive  function  (Eq.  (5.43)  we  generate  the  following  bounds  on  the  average  causal 
effect  under  the  monotonicity  assumption: 

Pw.i  +  Pii.i  —  Pio.o  —  Pii.o  <  ACE(Z)  — »■  F)  <  1  —  poi.i  —  Pio.o 
or 

ACE(Z  ^  F)  <  ACE(D  ^  F)  <  ACE(Z  ^  F)  +  pu.o  +  Poo.i 

This  shows  that  the  average  treatment  effect  evaluated  under  the  monotonicity 
assumption  is  always  at  least  as  great  as  the  causal  effect  evaluated  from  the 
intent-to-treat  analysis. 


5.7  Additional  Results 

5.7.1  Local  average-treatment  eflfect 

While  this  chapter  focuses  primarily  on  predicting  the  average  treatment  effect 
over  an  entire  population,  there  are  cases  where  one  would  be  interested  in  treat¬ 
ment  effects  averaged  over  a  subpopulation  of  special  characteristics.  [AIR93] 
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have  found  that,  under  the  assumption  of  non-defiance,  the  treatment  effect  av¬ 
eraged  over  the  subpopulation  of  perfectly  complying  individuals,  ACEc{D  — *•  K), 
can  be  identified  and  is  given  by  the  Instrumental  Variable  formula 


ACF  (D^Y)=  ^  -  Pjyilzo) 

’  ACE{Z^D)  P{di\zi)  -  P{di\zo) 


(5.44) 


In  other  words,  Eq.  (5.44)  gives  the  correct  treatment  effect  for  those  individuals 
whose  participation  in  the  treatment  D  comes  as  a  consequence  of  the  encour¬ 
agement  Z. 

This  can  be  verified  by  noting  that  a  compliant  subpopulation  is  characterized 
by  the  condition  =  1;  thus 


ACEc{D  -^Y)  =  P[yx\di,ri=l)  -  P{yx\do,rd=l) 

=  P(ry=llrd=l)  -  P(rj,=2|rd=l) 

P{ri=l,ry=l)  -  P{rd=\,ry=2) 

E(rd=l) 

_  _ gll  ~  gl2 _ 

?io  +  9ii  +  9i2  +  9i3 

This  last  expression  coincides  with  the  Instrumental  Variable  formula  above  under 
the  condition  of  non-defiance,  namely,  P{rd=2)  =  0,  or  q2j  =  0  for  j  =  0, . . . ,  3. 

It  is  worth  noting  that  the  subpopulation  of  perfectly  complying  individuals 
is  not,  in  general,  identifiable,  because  the  condition  =  1  cannot  be  determined 
from  the  triplet  {y,d,z).  Nevertheless,  the  behavior  of  this  subpopulation  may 
be  of  interest  to  analysts,  as  it  reveals  the  treatment  effect  under  ideal  conditions, 
free  of  noncompliance  side  effects.  Bounds  on  the  behavior  of  other  subpopula¬ 
tions  of  interest  can  be  obtained  by  methods  similar  to  those  in  Section  5.2.2. 


5.7.2  Treatment  effect  given  treatment  consumed 

Some  researchers  might  claim  that  the  average  treatment  effect  (ACE(Z)  — »■  F)) 
is  not  the  deciding  factor  when  developing  a  policy  for  patient  care,  because  they 
believe  that  patients’  compliance  behaviors  in  clinical  practice  will  be  similar  to 
their  compliance  during  drug  trials.  Although  the  author  disagrees  with  this  in¬ 
terpretation  —  patients  would  be  more  apt  to  follow  the  advice  of  their  physician 
in  taking  an  approved  drug  demonstrated  to  be  effective  —  this  section  will  ex¬ 
amine  the  conditional  treatment  effect  that  would  be  the  basis  for  their  policy 
decision. 


Ill 


If  patients  act  in  clinical  practice  as  they  do  in  drug  studies,  then  the  re¬ 
searcher  is  actually  interested  in  the  average  causal  effect  of  drug  treatment  for 
those  subjects  who  actually  consumed  the  drug  (di).  In  other  words,  derive 
bounds  for  P(j/j|dJ,di)  —  P(2/j|dQ,di)  given  the  observed  distribution  P(z,d,y). 

These  counterfactual  probabilities  may  be  written  in  terms  of  the  distribution 
of  respons  functions: 


PivlKd,) 

PivlKd,) 


P{zi)[q\2  +  gi3  +  ^32  +  ^33]  -f  P{zQ)[q22  +  ^23  +  g32  +  gas] 

P{d,) 

-P(^l)[gll  +  gl3  +  931  +  ^33]  +  +  g23  +  g31  +  gas] 

m) 


Taking  the  difference  between  these  two  expressions  gives  us  the  average  causal 


effect  of  treatment  on  response  for  those  individuals  who  took  the  treatment: 


ACE(T>  Y\di)  = 

P{zi)[q\\  +  g3i  ~  gi2  ~  932]  +  P{zo)[q2\  -f  q^l  —  ^22  ~  ^32] 

P(di) 


(5.45) 


Given  a  specific  distribution  for  the  treatment  assignment  P{z),  we  can  apply 
linear  symbolic  optimization  to  the  numerator  of  this  equation.  It  turns  out 
that  the  Q  space  expressions  multiplied  by  P{zi)  and  P{zq)  may  be  optimized 
independently.  This  can  be  shown  by  deriving  the  closed- form  bounds  on  /i(^  = 
9ii  +  931  —  912  —  932  and  /2(^  =  921  +  931  —  922  —  932  and  demonstrating  that  the 
sum  of  their  lower  (upper)  bounds  is  equal  to  the  closed-form  lower  (upper) 
bound  on  /i(^  -b  /2(^.  Since  the  coefficients  on  f\  and  {P{zi)  and  P{zq), 
respectively)  are  non-negative,  we  can  optimize  Eq.  (5.45)  by  optimizing  fi  and 
/2  independently. 

Following  this  strategy,  closed-form  bounds  on  the  conditional  average  causal 
effect  may  be  derived,  and  are  expressed  in  terms  of  the  distribution  of  observables 
P{y,d,z): 

P(2o)[piO.O  +  Pll.o  ~  Pio.l  ~  Pll.l]  ~  Pol.l 
P(^i)Ipio.i  +  Pll.l  —  Pio.o  ~  Pii.o]  —  Pol.o 
P(zo)IPii.O  —  Pio.l  —  Pll.l]  “  P(^l)pi0.0  ~  POl.O 
P(zi)Ipii.i  -  Pio.o  -  Pll.o]  -  E(2o)pio.i  -  Pol.l 
<  ACE(T)  ^  F|di)  < 

P(-^o)pu.o  +  P(^i)Ipio.i  +  Pll.l  —  Pio.o] 

P{zq)\Pio.o  +  Pll.o  -  Pio.l]  +  T’(^i)pii.i 
-f*(^l)[p00.0  +  POl.O  ~  Pol.l]  +  T'(2o)Poo.1  +  Pll.l 
-^(^OPoO.O  +  Pll.o  +  ■E('2^o)[P00.1  +  Pol.l  —  POl.o] 
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or 


Pid,) 


max  < 


P{yuZo)  -  P{y\\zi)P{zo)  -  P{yo,di\zi) 

P{yi,zi)  -  P{yi\zo)P{zi)  -  P{yQ,di\zQ) 

P{yi,ZQ)  -  P{yi\zi)P{zQ)  -  P(?/i,(/o|zo)  -  P{yQ,di\zQ) 
P{y\-,zi)  -  P{yi\zQ)P{z])  -  P{yi,do\zi)  -  P{yQ,d^\zi) 


<  kCE{D  Y\di)  < 

P{y,,z,)-P{y^\zo)P{z,)  +  P{y,,d,\zo) 

P{yi,zo)  -  P{yi\zi)P{zo)  +  P{yi,di\zi) 

P{yi,zi)  -  P{yi\zo)P{zi)  +  P{yo,do\zi)  +  P{yi,di\zi) 
P{yi,zo)  -  P{yi\zi)P{zo)  +  P{yo,do\zo)  +  P{yudi\zo) 


P{di) 


mm  < 


5.7.3  Divergence  of  intent-to-treat  analysis  from  treatment  effect 
bounds 

Strides  have  been  made  to  educate  the  scientific  community  to  the  potential  errors 
in  evaluating  treatment  effects  from  the  intent-to-treat  analysis,  ACE(Z  F)  = 
P{yi\zi)  —  P(yi\zo)]  however,  there  are  still  some  who  incorrectly  use  this  expres¬ 
sion  to  evaluate  a  drug’s  efficacy  in  a  quasi-experimental  study.  This  begs  the 
question,  just  how  inaccurate  is  the  intent-to-treat  analysis?  Even  though  this 
analysis  does  not  produce  the  correct  bounds  for  the  treatment  effect,  does  the 
computed  value  at  least  provide  a  feasible  value  for  the  treatment  effect?  Unfor¬ 
tunately,  not  always.  In  fact,  this  divergence  from  the  bounds  can  occur  even  in 
cases  where  ACE(Z  — >■  D)  approaches  100%.  For  example,  consider  the  point  in 
P  space: 


Poo.o  =  0.84 

Poo.i  =  0.08 

Poi.o  =  0.16 

Poi.i  =  0.00 

Pio.o  =  0.00 

Pio.i  =  0.12 

Pii.0  =  0.00 

Pn.i  =  0.80 

The  compliance  is  given  by  the  average  causal  effect  of  treatment  assignment  on 
treatment  received 

ACE(Z^D)  =  0.64 

and  the  bounds  on  the  average  causal  effect  are  computed  to  be 

0.68  <  ACE(D  ^Y)<  0.72 
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Hoever,  the  intent-to-treat  analysis  gives 

ACE(Z->r)  =  0.92 

Here  we  see  that  even  though  compliance  is  relatively  high,  the  intent-to-treat 
analysis  can  lead  to  results  outside  the  bounds  on  the  average  causal  effect. 

In  general,  ACE(Z  —>■  F)  will  fall  below  ACE(Z)  —>■  y)’s  lower  bound  in  the 
following  three  regions  of  P  space: 


Poi.o  +  Pio.o 

> 

Pol.l 

Pol.l  +  PlO.l 

> 

PlO.O 

Poo.l  +  Pll.o 

> 

PlO.l  +  Pll.l  +  Poo.o  +  Pol.o 

POl.l 

> 

Poi.o  +  Pio.o 

Pll.o 

> 

PlO.l  +  Pll.l  +  2P0I.0 

Pi  0.0 

> 

Pol.l  +  PlO.l 

Poo.l 

> 

Poo.o  +  Poi.o  +  |pi0.1 

In  addition,  ACE(Z  — >  F)  rises  above  ACE(£)  — »•  F)’s  upper  bound  in  the 
following  three  regions  of  P  space: 


Poo.o  +  Pll.o 

> 

Pll.l 

Poo.l  +  Pll.l 

> 

Poo.o 

Poi.o  +  PlO.l 

> 

PlO.O  +  Pll.o  +  Poo.l  +  Pol.l 

Pll.l 

> 

Poo.o  +  Pll.o 

Poi.o 

> 

Poo.l  +  Pol.l  +  IPiI-O 

Poo.o 

> 

Poo.l  +  Pll.l 

PlO.l 

> 

PlO.O  +  Pll.o  +  2P0O.I 
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In  all  other  regions  of  P  space  consistent  with  the  canonical  partial  compliance 
model,  ACE(Z  — >  Y)  will  fall  within  ACE(£)  — *•  F)’s  upper  and  lower  bounds. 

Therefore,  the  intent-to-treat  analysis  not  only  fails  to  reflect  the  uncertainty 
imposed  by  subject  noncompliance,  but  may  also  lead  to  an  estimate  of  the 
causal  effect  that  lies  outside  its  actual  bounds.  In  other  words,  the  intent-to- 
treat  analysis  can  not  even  state  that  its  estimate  is  potentially  correct  given  the 
observed  distribution. 


5.8  Conclusions 

This  chapter  provided  formulas  that  allow  analysts  to  make  categorical  state¬ 
ments  about  causal  effects  in  the  context  of  studies  where  subjects  are  only  par¬ 
tially  compliant.  These  formulas,  expressed  in  terms  of  the  distribution  over 
observed  variables  (treatment  assignment,  treatment  received,  and  observed  re¬ 
sponse),  represent  strict  upper  and  lower  bounds  for  the  average  causal  effect 
of  the  treatment  on  the  population.  These  bounds  are  applicable  to  all  studies 
where  the  assignment  itself  only  affects  the  observed  response  via  the  treatment 
actually  received,  regardless  of  any  interaction  that  might  take  place  between  the 
treatment  received  and  the  observed  response.  Aside  from  this  assumption,  the 
results  do  not  rest  on  any  particular  model  of  compliance  behavior. 

We  believe  that  the  results  presented  here  could  be  particularly  helpful  in 
quasi-experimental  studies,  that  is,  studies  in  which  randomized  mandated  treat¬ 
ments  are  either  unfeasible  or  undesirable  and  randomized  encouragements  are 
instituted  instead  ([H0I88]).  For  example,  in  evaluating  the  efficacy  of  a  social 
program,  the  randomized  instrument  can  be  advertisement,  incentives,  or  eligibil¬ 
ity,  letting  subjects  make  the  final  choice  of  participation.  The  bounds  established 
through  Eqs.  (5.29)  and  (5.30)  reveal  that  such  studies,  despite  the  indirectness  of 
the  randomized  instrument,  can  yield  valuable  information  on  the  average  causal 
effect  of  the  treatment  on  the  population. 

One  topic  that  should  receive  attention  in  future  work  is  the  maximum- 
likelihood  estimation  technique  for  finite  samples. 
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CHAPTER  6 


Continuous  treatments 

6.1  Introduction 

In  the  last  chapter,  strict  upper  and  lower  bounds  on  the  causal  effect  of  treatment 
on  response  from  partial  compliance  studies  were  derived  using  linear  optimiza¬ 
tion  techniques.  These  bounds  were  derived  for  a  model  where  the  set  of  observed 
variables  (treatment  assignment,  treatment  received,  and  observed  response)  are 
all  binary.  Aside  from  the  qualitative  structure  of  the  model,  those  results  are 
assumption  free.  [BP93,  Sections  1  and  2]  and  [Pea93a]  provide  motivation  for 
studying  the  causal  effects  identification  task  and  explain  the  basic  qualitative 
assumptions  which  are  applied  in  this  chapter  to  derive  results  applicable  when 
the  received  treatment  variable  is  not  binary. 

When  the  observed  received  treatment  is  not  binary,  it  is  difficult,  if  not 
impossible,  to  translate  the  causal  effect  bounds  evaluated  from  a  binary  model 
into  a  policy  statement.  For  example,  consider  a  quzisi-experiment  where  subjects 
are  encouraged  to  take  either  two  units  of  treatment  or  zero  units  of  treatment.  At 
the  end  of  data  collection  we  find  that  besides  zero  and  two  units,  many  subjects 
consumed  just  one  unit  of  treatment.  In  order  to  apply  the  bounds  of  Section  5.3 
the  received  treatment  must  be  transformed  to  a  binary  variable.  In  one  attempt, 
the  one  and  two  unit  treatments  are  merged  into  the  positive  received  treatment 
category  and  the  resulting  distribution  is  substituted  into  Eqs.  (5.29)  and  (5.30) 
to  compute  the  average  treatment  effect.  After  this  analysis  we  might  find  that 
the  lower  bound  on  the  treatment  causal  effect  is  positive;  therefore,  a  strict 
treatment  policy  is  developed  which  states  that  patients  who  meet  the  studies 
selection  criterion  will  be  forced  to  consume  two  units  of  treatment. 

Unfortunately,  this  transformation  of  the  three  value  domain  to  a  two  value 
domain  loses  information  about  the  distribution  of  received  treatment,  in  par¬ 
ticular,  between  one  and  two  units.  It  is  possible  that  relatively  few  subjects 
consumed  two  units  of  treatment,  and  for  those  subjects  the  treatment  had  a 
negative  causal  effect  on  response.  At  the  same  time,  the  treatment  causal  effect 
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was  strong  for  those  subjects  who  consumed  just  one  unit  of  treatment.  There¬ 
fore,  if  a  treatment  policy  is  implemented  that  forces  two  unit  consumption,  then 
subjects  will  suffer  negative  consequences  on  average.  Because  of  this  short¬ 
coming  of  the  binary  treatment  analysis,  this  chapter  will  further  partition  the 
received  treatment  domain  such  that  meaningful  and  safe  results  may  be  obtained 
for  continuous  received  treatment  data. 

[AI92]  and  [EF91]  have  looked  at  the  analysis  of  causal  effects  for  studies 
where  the  domain  of  the  received  treatment  variable  is  not  binary.  In  [EF91] 
the  partial  compliance  data  is  fit  by  a  naive  treatment-response  curve  for  both 
the  placebo  and  treatment.  The  actual  treatment-response  curve  is  then  related 
to  these  two  measurable  curves  along  with  other  unmeasurable  factors.  Specific 
assumptions  allow  estimation  of  the  actual  treatment-response  curve,  but  in  gen¬ 
eral,  this  is  not  possible.  Their  framework  differs  from  the  model  presented  here, 
in  that  the  observed  response  variable  in  [EF91]  is  continuous  allowing  specifica¬ 
tion  of  a  treatment-response  curve,  while  our  use  of  a  binary  observed  response 
variable  only  allows  us  to  specify  the  probability  that  the  response  will  fall  within 
a  particular  range.  [AI92]  partition  the  continuous  treatment  variable  and  show 
that  under  a  condition  of  monotonicity  the  treatment  causal  effect  can  be  deter¬ 
mined  for  the  class  of  subjects  whose  treatment  is  influenced  by  their  treatment 
assignment.  Their  partitioning  of  the  treatment  variable  and  direct  use  of  the 
continuous  treatment  response  allows  evaluation  of  a  treatment-response  curve. 

Section  6.2  describes  the  received  treatment  partitioning  strategy  which  en¬ 
ables  derivation  of  bounds  on  the  causal  effect  of  one  treatment  level  versus  a 
base  treatment  level  when  the  domain  of  received  treatment  is  continuous.  An 
example  demonstrating  the  application  of  those  closed-form  bounds  will  then  be 
presented  in  Section  6.3.  In  Section  6.4  we  will  demonstrate  that  these  bounds 
may  be  further  tightened  when  more  than  two  homogeneous  treatment  levels 
are  extracted  from  the  continuous  domain.  Section  6.5  presents  some  concluding 
remarks. 


6.2  Derivation  of  continuous  treatment  bounds 

Suppose  that  the  domain  of  the  treatment  variable  D  is  no  longer  binary,  but 
is  now  continuous.  We  are  interested  in  the  average  causal  effect  of  one  level  of 
treatment  do  (the  control)  versus  the  effect  of  another  level  di  (nominal  treat¬ 
ment).  All  other  treatments  in  T)’s  domain  not  in  {do,di}  will  be  labelled  by 
dm  • 


117 


Because  do  and  dx  coincide  with  exact  values  of  treatment,  the  independence 
relations  discussed  in  Chapter  5  still  hold: 

Z  \  [D  =  do.U] 

Z  ^Y\{D  =  dx.U} 

However,  Z  and  Y  are  no  longer  independent  given  U  and  D  =  d^: 

Z  ;^Y\  [D^drn.U] 

We  can  derive  bounds  on  the  average  causal  effect  ACE(D  Y)  by  first 
specifying  the  model  not  in  terms  of  a  completely  functional  model,  but  in  terms 
of  a  partial  functional  model.  Because  D  is  continuous,  we  cannot  completely 
specify  the  response  function  Vy  mapping  D  to  Y.  Instead  we  specify  the  partial 
response  function  mapping  only  part  of  Z)’s  domain  (do  and  di)  to  F: 

y  ~  fy{^i‘^y)  ~  ^y,Ty{d) 

where 


hy,o{d)  —  1 

f  yo 

[  undef 

if  d  G  {do,di} 
if  d  —  dm 

'  yo 

II 

KAd)  =  < 

2/1 

II 

undef 

II 

yi 

II 

hya{d)  =  ^ 

yo 

if  d  =  di 

undef 

II 

KM  =  \ 

!  2/1 

[  undef 

if  c?  G  {do^ 

if  c?  —  d-fji 

D  is  still  functionally  defined  by 

d  =  fd{z,rd)  =  hd,rA^)  (6.1) 


where 


hd, oi^) 
hd,iiz) 


do 

{do  if  0  =  zq 

dm  if  2  =  2l 
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hdM 

hdfi{z) 

hd,7{z) 

hdfi{z) 


\i  z  =  zo 
ii  z  =  Zi 

if  z  =  Zo 
if  z  =  zi 


d 


m 


if  z  =  zo 
if  z  =  zi 

if  z  =  Zq 
if  z  =  zi 

if  Z  =  Zq 

if  z  =  Zi 


= 


Let  Qjk  =  P{rd—j,ry—k).  Then  we  may  write  the  linear  relationship  between 
the  P  space  and  the  Q  space  as  follows: 


P{yo,do\zo) 

P{yo,di\zo) 

P{yi,do\zo) 

P{yi,<ii\zo) 

P{dm\zo) 


yoo  +  Qoi  +  9l0  +  <lu  +  920  +  921 

960  +  962  +  970  +  972  +  980  +  982 

902  +  903  +  912  +  9l3  +  922  +  923 

961  +  963  +  971  +  973  +  981  +  983 

930  +  931  +  932  +  933  +  940  +  941  +  942  +  943  + 

950  +  951  +  952  +  953 


P{yo,do\zi) 

P{yo,di\zi) 

P{yi,do\zi) 

P{yi,di\zi) 

P{dm\zx) 


9oo  +  9oi  +  930  +  931  +  960  +  96i 

920  +  922  +  950  +  952  +  980  +  982 

902  +  903  +  932  +  933  +  962  +  963 

921  +  923  +  951  +  953  +  981  +  983 

9l0  +  9ll  +  912  +  9l3  +  940  +  941  +  942  +  943  + 

970  +  971  +  972  +  973 


The  reason  why  P{yo,dm\z)  +  Piyi^d^lz)  is  treated  as  a  single  value  P{dm\z), 
is  that  the  individual  components  cannot  be  expressed  in  terms  of  the  Q  space 
parameters. 
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In  terms  of  the  Q  parameter  space  we  can  write  the  average  causal  effect  as 
ACE(II^F)  = 

3=0 

Given  this  objective  function  and  linear  constraints  on  the  Q  space,  we  may 
derive  general  upper  and  lower  bounds  on  ACE(jD  —y  E): 

Poo.o  +  Pi  1.1  —  1 
Poo.i  +  Pii.i  -  1 
Pii.o  +  Poo.i  —  1 
Poo.o  +  Pi  1.0  —  1 

2poo.o  +  Pii.o  +  Pio.i  +  Pi  1.1  —  2 
Poo.o  +  2pii.o  +  poo.i  +  Poi.i  ~  2 
Pio.o  +  Pii.o  +  2poo.i  +  Pi  1.1  —  2 
Poo.o  +  Poi.o  +  Poo.i  +  2pii.i  —  2 

1  —  Pio.o  —  Poi.i 
1  —  Poi.o  “  Pio.i 
1  —  Poi.o  —  Pio.o 
1  —  Poi.i  —  Pio.i 

2  —  2poi.o  —  Pio.o  “  Pio.i  —  Pii.i 
2  —  Poi.o  “  2pio.o  —  Poo.i  —  Poi.i 
2  —  Pio.o  —  Pii.o  ~  2poi.i  “■  Pio.i 
2  —  Poo.o  —  Poi.o  ~  Poi.i  ~  2pio.i 

It  is  very  important  to  understand  that  no  assumptions  whatsoever  have  been 
made  about  the  range  of  dm,  or  the  functional  mapping  from  any  values  in  dm 
to  Y .  Therefore,  these  bounds  hold  true  (they  might  be  loose  if  we  obtain  more 
specific  information  about  dm)  regardless  of  the  composition  of  dm- 

Often,  in  the  real  world,  practically  no  subjects  will  consume  exactly  do  or 
di  units  of  treatment.  Therefore,  we  must  make  an  assumption  that  there  exists 
homogeneous  treatment  windows  around  do  and  di-  In  other  words,  if  any  subject 
forced  to  consume  di  (do)  units  of  treatment  has  a  response  of  T  =  y,  then  if 
the  subject  would  have  consumed  an  amount  of  treatment  in  [d\  —  6,  di  +  d] 
([do  —  e,  do  —  e]),  the  subject  would  have  had  the  same  response  Y  =  y.  This 
is  a  reasonable  assumption  when  the  window  sizes  (2d  and  2e)  are  much  smaller 
than  the  difference  between  di  and  do.  If  do  is  defined  as  zero  units  of  treatment, 
then  it  is  desirable  that  the  window  be  of  zero  width  (e  =  0).  These  will  be  the 
assumptions  made  in  our  reanalysis  of  the  Lipid  study  data. 
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6.3  Example 


We  will  now  show  by  example  how  the  bounds  of  Eqs.  (6.2)  and  (6.3)  can  be  used 
to  provide  meaningful  information  about  causal  effects.  Reconsider  the  Lipid  Re¬ 
search  Clinics  Coronary  Primary  Prevention  Trial  data  described  in  Section  5.4. 


In  order  to  apply  our  analysis  to  this  study,  the  continuous  data  obtained  in 
the  [Pro84]  study  must  be  transformed  to  the  discrete  variables  representing  treat¬ 
ment  assignment  (Z),  received  treatment  (D),  and  observed  response  (F).  The 
following  transformation  accomplishes  this  by  thresholding  dosage  consumption 
and  change  in  cholesterol  level: 


d 

y 


do  li  z  =  zo  or  b  —  0 

I  di  \i  z  =  zi  and  'y  —  p<b<j-\-p 

dm  otherwise 

(  yo  if  Cl  -cp  <S 

\  yi  if  Cl  -cf>S 


(6.4) 

(6.5) 


7  and  p  are  the  center  and  radius  of  the  window  of  positive  treatment,  while  S 
specifies  the  minimum  decrease  in  cholesterol  level  which  we  consider  a  positive 
treatment.  This  discretization  assumes  that  subjects  taking  between  7  —  p  and 
7  -f  p  units  of  cholestyramine  form  a  homogeneous  treatment-response  group.  In 
addition,  Eq.  (6.4)  reflects  the  finding  that  subjects  assigned  placebo  (2:0)  did  not 
take  cholestyramine,  namely, 


P{d,\zo)  =  0 

P{dm\zo)  =  0 


Clearly,  by  varying  this  threshold  over  the  range  of  Y  one  obtains  upper  and  lower 
bounds  on  the  entire  distribution  of  the  treatment  effect,  P{Y*  <  y\d\)  —  P{Y*  < 
y\do)- 

For  the  current  analysis  we  set  p  =  7  and  7  =  94,  while  the  threshold  for 
cholesterol  level  reduction  in  Eq.  (6.5)  was  selected  at  ^  =  38  units.  If  the 
data  samples  are  interpreted  according  to  (6.4)  and  (6.5),  then  the  conditional 
distribution  over  {Z,D,Y)  results  in  the  distribution  given  in  Table  6.F 

By  computing  the  quantities  required  for  (6.2),  we  obtain 


ACE(D 


0.262,  -0.685,  -0.976,  -0.029, 
0.233,  -0.902,  -1.632,  -0.423 


0.262 


^We  make  the  large-sample  assumption 
P{y,  d\z). 


and  take  the  sample  frequencies  as  representing 
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P{y,d\z) 

Zo 

yo 

j/i 

yo 

yi 

do 

0.971 

0.029 

0.024 

0.000 

dm 

0.000 

0.000 

0.436 

0.146 

di 

0.000 

0.000 

0.103 

0.291 

Table  6.1:  Conditional  probability  distribution  P{y,d\z)  for  the  Lipid  Research 
Clinic  Program  (1984)  data,  discretized  by  Eqs.  (6.4)  and  (6.5). 


Those  needed  for  (6.3)  give  us 


ACE(T> 


Y)  <  min  • 


0.868,1.000,0.971,0.897, 

1.680,1.815,1.765,0.926 


=  0.868 


Accordingly,  we  conclude  that  the  treatment  causal  effect  lies  in  the  range 
0.262  <  ACE(T)  K)  <  0.868 

which  is  quite  informative;  the  experimenter  can  categorically  state  that  when 
applied  uniformly  to  the  population,  a  dosage  of  84  to  101  units  of  cholestyramine 
is  guaranteed  to  improve  by  at  least  26.2%  the  probability  of  reducing  a  patient’s 
level  of  cholesterol  by  38  points  or  more.  This  guarantee  is  established  despite  the 
fact  that  60.6%  of  the  subjects  in  the  treatment  group  did  not  comply  with  their 
assigned  dosage  level.  For  comparison,  note  that  the  intent- to- treat  analysis 
in  this  study  gives  P{yi\z\)  —  P(yi\zo)  =  0.408,  meaning  that  enforcing  full 
compliance  might  result  in  as  much  as  26%  improvement  and  no  more  than 
14.6%  reduction  in  the  proportion  of  patients  benefiting  from  the  treatment. 

In  the  above  analysis,  we  selected  p  and  7  such  that  the  subjects  who  con¬ 
sumed  the  greatest  quantity  of  cholestyramine  would  be  classified  as  having  re¬ 
ceived  positive  treatment.  This  is  not  necessary,  though;  beyond  a  certain  dosage, 
a  treatment  may  actually  impede  the  mechanism  whereby  positive  response  is  at¬ 
tained.  It  is  possible  that  a  higher  feasible  range  of  causal  effects  may  be  attain¬ 
able  by  examining  a  different  range  of  consumed  treatment  than  the  maximum 
range.  Hence,  we  can  re-analyze  the  cholesterol  treatment  by  evaluating  the  fea¬ 
sible  range  of  ACE(Z)  — >  Y)  for  different  values  of  7  while  keeping  p  and  8  fixed. 
Figure  6.1  presents  the  results  of  this  analysis,  and  shows  that  the  highest  lower 
bound  on  the  treatment  causal  effect  is  obtained  when  we  use  the  maximum  re¬ 
ceived  treatment.  In  a  sense,  this  graph  can  be  viewed  as  a  treatment-response 
curve,  where  the  differences  in  the  probability  of  reducing  the  cholesterol  level 
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by  at  least  S  units  under  treatment  and  placebo  are  plotted  against  the  received 
treatment  (7),  rather  than  a  plot  of  the  difference  in  cholesterol  plotted  against 
received  treatment. 

It  is  interesting  to  see  how  the  bounds  on  ACE(Z>  — >  V)  in  this  study  are  de¬ 
pendent  on  the  threshold  (5)  used  to  transform  the  continuous  observed  response 
to  the  binary  observed  response.  The  results  in  Figure  6.1  indicate  that  the 
maximum  received  treatment  in  the  cholestyramine  study  gives  the  highest  lower 
bound  on  the  treatment  causal  effect  (which  is  preferred  in  this  case);  therefore, 
we  plot  the  treatment  causal  effect  of  the  maximum  cholestyramine  dosage  as  a 
function  of  S.  These  results  are  rendered  in  Figure  6.2. 

6.4  Further  decomposition  of  treatment 

It  is  possible  that  the  bounds  calculated  from  Eqs.  (6.2)  and  (6.3)  may  become 
tighter  as  the  remaining  treatment  class  dm  is  further  decomposed  and  incor¬ 
porated  into  the  analysis.  For  example,  suppose  that  the  treatment  variable’s 
domain  may  be  partitioned  into  three  variables,  such  that  the  independence  as¬ 
sumptions  are  sufficiently  accurate: 

Z±Vl{£>  =  do,[/} 

Z  J_V  I  (B  =  dt,[/} 

Z  j\_Y\  {D  =  dm,U} 

In  this  case,  the  domain  of  the  treatment-response  variable,  Vy,  may  be  partitioned 
into  eight  values,  where  Y  is  functionally  determined  by  D  and  ry 

y  ~  fyiAi^y)  ~  ^y,Tyid')  (b-6) 

where 

t/o  if  d  G  {do,  d„j} 
t/i  if  d  =  di 

j/o  if  d  =  do 
1/1  if  d  =  dm 
yo  if  d  =  di 

yo  if  d  =  do 
t/i  if  d  G  {df,i,  di } 


—  Vi 

hy^i{d)  =  I 
hy^d)  =  < 

h,Md)  =  I 
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ACE(D->Y) 


0  20  40  60  80  100 

Positive  Treatment  Window  Center 


Figure  6.1:  Ranges  of  ACE(jD  —*  Y)  evaluated  for  the  cholestyramine  treatment 
data  for  different  positive  treatment  window  centers  ('y).  For  all  values  of  'y,  the 
radius  of  the  positive  treatment  window  (p)  is  1  and  the  positive  observed  response 
threshold  (8)  is  38. 
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ACE(D->Y) 


1.00 


0.80 


0.60 


0.40 


0.20 


0.00 


-0.20 


0  20  40  60  80  100 

Positive  Treatment  Response  Threshold 


Figure  6.2:  Ranges  of  ACE(Z)  — >•  K)  evaluated  for  the  cholestyramine  treatment 
data  for  different  positive  observed  response  thresholds  (6).  For  all  values  of  S, 
the  radius  of  the  positive  treatment  window  (p)  is  7  and  the  positive  treatment 
window  center  (j)  is  94- 
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^y,5{d) 


yi 

<  yo 


if  d  =  do 
ifd€{d 

vni  ii) 

if  d  —  do 

if  C?  -  dyp^ 

if  d  =  di 

if  d  ^  \^do^  dffi'^ 

if  d  =  di 


=  2/1 


The  causal  effect  of  the  treatment  can  now  be  obtained  directly  from  Eqs.  (6.1) 
and  (6.6),  giving 

I’Wk)  =  P(>-,=1)  +  F(r,=3)  +  F(r,=5)  +  P(r,=7) 

P(y:K)  =  ■P('-,=4)  +  ^(>-,=5)  +  P(r,=6)  +  F(r,=7) 

and 

ACE(D^K)  =  F(r,=l)  +  7>h=3)  - /'(r,=4)  -  P(r,=6)  (6.7) 

The  distribution  over  potential  responses,  P{rd,ry),  is  specified  by  72  param¬ 
eters.  Let  these  parameters  be  notated  as  follows: 

qij  =  Pird=i,ry=j) 

The  probabilistic  constraint 

8  7 

iZ  yjk  =  1 

z=0  j=0 

implies  that  there  are  only  71  independent  parameters. 

In  terms  of  the  Q  parameter  space  we  can  rewrite  Eq.  (6.7)  as 

ACE(D  — >•  T)  =  +  qj3  -  qj4  -  qje] 

j=o 

The  conditional  distribution  P{y,d\z)  over  the  observable  variables  is  fully 
specified  by  12  parameters,  which  will  be  notated  as  follows: 

Poo.o  =  Piyo,do\zo) 
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POm.O 

c/tti  l^o) 

POl.O 

o 

o 

II 

PlO.O 

—  P(pi,<^ol^o) 

Plm.O 

—  P{yiidm\^o') 

Pi  1.0 

=  P{yi,di\zo) 

Poo.i 

=  Piyo,do\zi) 

POm.l 

T*(j/o7  l^i) 

POl.l 

=  P{yo,di\zi) 

PlO.l 

=  P{yi,do\zi) 

Plm.l 

=  P{yudm\zi) 

Pll.l 

=  P(j/i,di|zi) 

The  probabilistic  constraints 

13  Pij-o  =  1 

i6{0,l}  i€{0,m,l} 

13  E  Pb-i  =  1 

*e{0,l}i€{0,m,l} 

implies  that  there  are  only  10  independent  parameters. 

Given  some  point  q  inQ  space,  there  is  a  direct  linear  transformation  to  the 
corresponding  point  p  in  the  observation  space  P: 

Poo.o  =  qoo  +  ?oi  T  qo2  +  qo3  +  qw  +  qn  +  912  +  qi3  +  920  +  921  +  922  +  923 

Pom.o  =  930  +  931  +  934  +  935  +  940  +  941  +  944  +  945  +  950  +  951  +  954  +  955 

POl.O  =  960  +  962  +  964  +  966  +  970  +  972  +  974  +  976  +  980  +  982  +  984  +  986 

1^10.0  =  904  +  905  +  906  +  9o7  +  9l4  +  9l5  +  9l6  +  9l7  +  924  +  925  +  926  +  927 

Plm.O  =  932  +  933  +  936  +  937  +  942  +  943  +  946  +  947  +  952  +  953  +  956  +  957 

Pll.O  =  961  +  963  +  965  +  967  +  971  +  973  +  975  +  977  +  981  +  983  +  985  +  987 

Poo.l  =  9oo  +  901  +  902  +  903  +  930  +  931  +  932  +  933  +  960  +  961  +  962  +  963 

POm.l  =  9l0  +  9ll  +  914  +  9l5  +  940  +  941  +  944  +  945  +  970  +  971  +  974  +  975 

POl.l  =  920  +  922  +  924  +  926  +  950  +  952  +  954  +  956  +  980  +  982  +  984  +  986 
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PlO.l  =  904  +  ?05  +  906  +  907  +  934  +  935  +  936  +  937  +  964  +  965  +  966  +  967 

Plm.l  =  9l2  +  9l3  +  9l6  +  9i7  +  942  +  943  +  946  +  947  +  972  +  973  +  976  +  977 

Pll.l  =  921  +  923  +  925  +  927  +  951  +  953  +  955  +  957  +  981  +  983  +  985  +  987 

which  will  be  written  in  matrix  form,  p  =  Pq.  This  relationship  between  points 
in  Q  and  P  space  imply  the  following  constraints  on  points  in  P  space: 

POO.O  +  PlO.l  <  1 

POm.O  4"  Plm.l  —  1 

P01.0+ Pll.l  <  1 

PlO.O  +  Poo.i  ^  1 

Plm.O  4"  POm.l  ^  1 
Pii.o  4-  Poi.i  <  1 

Similar  to  the  P  space  constraints  in  the  binary  case  (Appendix  A.l),  we  can 
prove  that  these  constraints  are  necessary  and  sufficient  for  a  point  in  P  space 
to  be  modelled  by  some  point  in  Q  space. 

ACE(D  Y)  may  be  optimized  given  the  constraints 
P  =  Pq 

8  7 

Eii9jfc  =  1 

j=zO  k=0 

qjk  >  0  j  €  {0,  ...,8}  and  fc  e  {0,...,7} 

using  a  program  written  for  obtaining  symbolic  solutions  to  linear-programming 
problems.  The  following  lower  and  upper  bounds  for  ACE(T>  — >  Y)  were  ob¬ 
tained: 

Poo.o  4-  Pll.l  —  1 
Poo.i  4-  Pll.l  —  1 
Pii.o  4-  Poo.i  —  1 
Poo.o  4-  Pi  1.0  —  1 

2poo.o  4-  Pii.o  4-  PlO.l  4-  Pll.l  —  2 
Poo.o  4-  2pii.o  4-  Poo.i  4-  Poi.i  ~  2 
PlO.O  4-  Pii.o  4-  2poo.i  4-  Pll.l  —  2 
Poo.o  4-  Poi.o  4*  Poo.i  4-  2pii.i  —  2 
—POm.O  ~  Poi.o  —  PlO.O  ~  Poi.i  ~  PlO.l  ~  Plm.l 
“Poi.o  “  PlO.O  “  Plm.O  “  Pom.l  “  POl.l  “  PlO.l 
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1  —  PlO.O  —  Pol.l 
1  —  Pol.o  —  Pio.l 
1  ~  POl.O  ~  Pio.o 
1  ~  Pol.l  —  Pio.l 
2poi.o  ~  Pio.o  ~  Pio.i  —  Pii.i 
Pol.o  ~  2pio.o  —  Poo.i  ~  Pol.i 
Pio.o  —  Pii.o  —  2poi.i  —  Pio.i 
Poo.o  —  Pol.o  —  Pol.i  ~  2pio,i 
Poo.o  +  Pom.o  +  Pll.o  +  Poo.l  +  Plm.l  +  Pll.l 
^  Poo.o  +  Plm.O  +  Pll.O  +  POO.I  +  POm.l  +  Pll.l  ^ 


Un^Yip)  =  min< 


2 

2 

2 

2 


6.5  Conclusion 

In  this  chapter,  the  strict  bounds  on  the  causal  effect  of  treatment  on  observed 
response  have  been  derived  for  models  where  received  treatment  take  on  non¬ 
binary  values  and  treatment  assignment  and  observed  response  are  binary.  These 
bounds  can  be  used  to  derive  useful  bounds  for  treatment  causal  effects  on  quasi- 
experimental  data  containing  continuous  values  of  received  treatment.  By  useful, 
we  mean  that  a  policy  statement  for  treatment  may  be  specified. 

In  future  work,  we  might  explore  how  exact  knowledge  of  one  point  in  the 
treatment-response  curve  can  constrain  the  bounds  of  the  rest  of  the  treatment- 
response  curve;  for  example,  there  might  exist  experimental  data  that  has  pre¬ 
cisely  determined  the  causal  effect  of  a  specific  drug  dosage  on  the  probability  of 
recovering  from  a  particular  ailment. 
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CHAPTER  7 


Statistics  in  Law 


7.1  Introduction 

Over  the  last  thirty  years,  there  has  been  a  steady  increase  in  the  number  of 
court  cases  where  statistical  evidence  has  been  introduced.  These  include  cases 
involving  product  liability  [BBB92],  employment  discrimination  [MSZ84]  [FinSO], 
price-fixing  (anti- trust  litigation)  [DF85],  genetic  evidence,  etc  [Kay90].  Most 
of  the  statistical  evidence  provided  is  in  the  form  of  regression  analysis  [Fis80] 
[Fin80]  [DF85],  or  by  comparing  the  relative  rates  of  some  event  across  differ¬ 
ent  populations,  e.g.,  hiring  rates  among  different  races/sexes,  or  cases  of  illness 
among  employees  and  non-employees.  Most  analysis  of  the  data  goes  into  show¬ 
ing  the  accuracy  of  the  results  given  the  sample  size,  but  except  for  regression 
analysis,  qualitative  information  is  introduced  after  the  statistical  data  has  been 
presented. 

The  comparison  of  relative  rates  among  different  populations  may  produce 
serious  errors  in  judgment,  as  will  be  shown  by  a  hypothetical  example  in  Sec¬ 
tion  7.2.  These  rates  demonstrate  dependence,  but  do  not  necessarily  prove  that 
those  rates  are  a  result  of  a  defendants  actions,  because  other  unobserved  fac¬ 
tors  may  be  responsible  for  the  dependency.  It  turns  out  that  the  participants 
of  the  court  case  ask  the  right  question  (in  terms  of  a  counterfactual),  e.g.,  “If 
the  plaintiff  were  male,  would  she  have  been  promoted  a  year  earlier?”;  however, 
although  the  right  question  is  usually  posed,  the  analysis  usually  fails  to  reflect 
the  structure  of  the  problem. 

Ideally,  a  court  would  settle  upon  a  qualitative  model  describing  the  causal 
relationships  between  variables  in  the  system.  Then  the  statistical  data  would 
be  applied  to  parameterize  the  causal  structure.  Given  this  model  of  the  system, 
the  counterfactual  conditional  may  then  be  evaluated  and  incorporated  into  the 
judgment. 

When  applying  regression  analysis,  the  variable  claimed  to  be  the  basis  of 
discrimination  or  unfair  control  is  assumed  to  be  exogenous  to  the  system  (i.e., 
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a  root  node  in  the  causal  structure).  Chapter  8  showed  that  not  all  variables 
that  may  be  controlled  satisfy  this  assumption,  e.g.,  in  a  price-fixing  case,  the 
controlled  variable  is  a  product’s  price  that  is  ordinarily  influenced  by  other 
market  factors. 

It  is  very  important  that  the  model  proposed  by  counsel  is  compatible  with 
the  statistical  model  provided  as  evidence.  Such  a  case  is  discussed  in  [GKR94], 
where  it  was  hypothesized  that  college  attendance  rates  accounted  for  the  dif¬ 
ference  in  pa.ss-rates  between  men  and  women  on  a  test  for  promotion,  yet  the 
two  distributions  (one  relating  gender  and  college  attendance,  the  other  relat¬ 
ing  gender  and  rate  of  promotion)  were  inconsistent  with  a  model  where  college 
attendance  is  supposed  to  explain  away  any  gender  bias.  In  a  similar  vein.  Chap¬ 
ter  5  presented  constraints  (Eq.  5.13)  on  the  observed  distribution  imposed  by  the 
assumed  model  of  interaction  in  experimental  studies  with  partial  compliance.  If 
a  distribution  fails  these  constraints,  then  the  assumed  model  is  improper  for 
evaluating  average  treatment  effects. 


7.2  Hypothetical  Product  Safety  Litigation 

Evaluation  of  counterfactual  probabilities  could  be  enlightening  in  some  legal 
cases  in  which  a  plaintiff  claims  that  a  defendant’s  actions  were  responsible  for 
the  plaintiff’s  misfortune.  Improper  rulings  can  easily  be  issued  without  an  ad¬ 
equate  treatment  of  counterfactuals.  Consider  the  following  hypothetical  and 
fictitious  case  study,  especially  crafted  to  accentuate  the  disparity  between  dif¬ 
ferent  methods  of  analysis. 

The  marketer  of  PeptAid  (antacid  medication)  randomly  mailed  out  product 
samples  to  10%  of  the  households  in  the  city  of  Stress,  California.  In  a  follow¬ 
up  study,  researchers  determined  for  each  individual  whether  they  received  the 
PeptAid  sample,  whether  they  consumed  PeptAid,  and  whether  they  developed 
peptic  ulcers  in  the  following  month. 

The  causal  structure  which  describes  the  influences  in  this  scenario  is  iden¬ 
tical  to  the  partial-compliance  model  given  by  Figure  5.1,  where  zi  asserts  that 
PeptAid  was  received  from  the  marketer;  di  asserts  that  PeptAid  was  consumed; 
and  yi  a.sserts  that  peptic  ulceration  occurred.  The  data  showed  the  following 
distribution: 


P(2i)  =  0.1 


I 
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P{yQ,d(i\zQ)  =  0.32 
P{yo-,d]\zo)  =  0.32 
P{yi^dQ\zo)  =  0.04 
P{y^■,d■^\zo)  =  0.32 


P{yQ,do\zi)  =  0.02 

P{yo,dx\z-i)  =  0.17 
P{y\-,dQ\zi)  =  0.67 
P{y\-,di\zi)  =  0.14 


This  data  indicates  a  high-correlation  between  those  individuals  who  consumed 
PeptAid  and  those  who  developed  peptic  ulcers  in  the  following  month 

P(yi|di)  =  0.50  P(yi|do)  =  0.26 


In  addition,  the  intent-to-treat  analysis  showed  that  those  individuals  who  re¬ 
ceived  the  PeptAid  samples  had  a  45%  greater  chance  of  developing  peptic  ulcers 

P(yi|zi)  =  0.81  P{y^\zo)  =  0M 


The  plaintiff  (Mr.  Smith),  having  heard  of  the  study,  litigated  against  both 
the  marketing  firm  and  the  PeptAid  producer.  The  plaintiff’s  attorney  argued 
against  the  producer,  claiming  that  the  consumption  of  PeptAid  triggered  his 
client’s  ulcer  and  resulting  medical  expenses.  Likewise,  the  plaintiff’s  attorney 
argued  against  the  marketer,  claiming  that  his  client  would  not  have  developed 
an  ulcer,  if  the  marketer  had  not  distributed  the  product  samples. 

The  defense  attorney,  representing  both  the  manufacturer  and  marketer  of 
PeptAid,  though,  rebutted  this  argument,  stating  that  the  high  correlation  be¬ 
tween  PeptAid  consumption  and  ulcers  was  attributable  to  a  common  factor, 
namely,  pre-ulcer  discomfort.  Individuals  with  gastrointestinal  discomfort  would 
be  much  more  likely  to  both  use  PeptAid  and  develop  stomach  ulcers.  To  bol¬ 
ster  his  clients’  claims,  the  defense  attorney  introduced  expert  analysis  of  the 
data  showing  that,  on  the  average,  consumption  of  PeptAid  actually  decreases 
an  individual’s  chances  of  developing  ulcers  by  at  least  15%. 

Indeed,  the  application  of  Eqs.  5.29  and  5.30  results  in  the  following  bounds 
on  the  average  causal  effect  of  PeptAid  consumption  on  peptic  ulceration 

-0.23  <  ACE(£>  F)  <  -0.15 

and  proves  that  PeptAid  is  beneficial  to  the  population  as  a  whole. 

The  plaintiff’s  attorney,  though,  stressed  the  distinction  between  the  average 
treatment  effects  for  the  entire  population  and  the  sub-population  consisting  of 
those  individuals  who,  like  his  client,  received  the  PeptAid  sample,  consumed  it 
and  then  developed  ulcers.  Analysis  of  the  population  data  indicated  that  had 
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PeptAid  not  been  distributed,  Mr.  Smith  would  have  had  at  most  a  7%  chance 
of  developing  ulcers  regardless  of  any  confounding  factors  such  as  pre-ulcer  pain. 
Likewise,  if  Mr.  Smith  had  not  consumed  PeptAid,  he  would  have  had  at  most 
a  7%  chance  of  developing  ulcers. 

The  damaging  statistics  against  the  marketer  are  obtained  by  evaluating  the 
bounds  on  the  probability  that  the  plaintiff  would  have  developed  a  peptic  ulcer  if 
he  had  not  received  the  PeptAid  sample,  given  that  he  in  fact  received  the  sample 
PeptAid,  consumed  the  PeptAid,  and  developed  peptic  ulcers.  This  probability 
may  be  written  in  terms  of  the  functional  model  parameters; 


Piy{\zo,yudi,Zi) 


-P(^2— l)[gl3  +  ^31  +  gas] 
P{yudi,zi) 


But,  since  Z  is  a  root  node  in  the  probabilistic  specification,  P(r^=l)  =  Pi^i)', 
therefore. 


P{yl\2Q,yi,di,zi) 


P{yi,di\zi) 
Qi3  +  931  +  933 
PU.l 


This  expression  is  linear  with  respect  to  the  Q  parameters;  therefore,  we  may  use 
linear  optimization  to  derive  symbolic  bounds  on  the  counterfactual  probability 
with  respect  to  the  probabilistic  specification  P{y,d\z): 


0 


1 

Pii.i 


max  < 


Pi  1.1  —  Poo.o 
Pii.o  —  Poo.i  —  Pio.i 
.  Pio.o  ~  Poi.i  —  Pio.i  , 

<  P{yi\zo,^i,di,yi)  < 


1  .  I 

- mm  < 

Pii.i 


Pii.i 

Pio.o  +  Pll.o 
1  —  Poo.o  —  Pio.l 


> 


Similarly,  the  damaging  evidence  against  PeptAid’s  producer  is  obtained  by 
evaluating  the  bounds  on  the  counterfactual  probability  P{yl\dl,yi,di,zi).  In 
terms  of  the  Q  parameters  the  counterfactual  probability  is  written: 


PiViK^yuduZi) 


9l3  +  933 _ 

9ll  +  913  +  931  +  933 
9l3  +  933 
Pll.l 
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If  we  minimize/maximize  the  numerator  given  the  linear  constraints,  we  arrive 
at  the  following  bounds: 


I  “  1 

max  <  pii.i  -  Poo.o  -  Pii.o  > 

[  Pio.o  —  Poi.i  —  Pio.i 

Pii.i 

Pio.o  +  Pi  1.0  ^ 

1  —  Poo.o  —  Pio.l 

Substituting  the  observed  distribution  F(i/,djz)  into  these  formulas,  the  fol¬ 
lowing  bounds  were  obtained 

0.00  <P(j/i*|i*,^i,di,t/i)<  0.07 
0.00  <P(t/ri4^i,di,yi)<  0.07 


^  •  J 

- mm  < 

Pii.i 


We  can  write  the  average  causal  effects  for  the  sub-population  resembling  the 
plaintiff  by  conditioning  the  counterfactual  probabilities  in  Eqs.  (5.16)  and  (5.17) 
on  the  features  of  the  plaintiff. 

ACE(n^Vlzi,cIi,i/i)  = 


Counterfactual  probabilities  have  the  property  that  if  the  counterfactual  an¬ 
tecedent  is  implied  by  the  real-world  observation,  then  the  probability  of  the 
counterfactual  consequent  is  the  same  as  in  the  real-world  given  the  observations: 


P(c*|o*,o)  =  P(c  =  c*|o) 


Therefore, 

P(y*|zt,zi,di,yi)  =  1.00 

P(p:K,Zt,dr,yi)  =  1.00 

and 


0.93  <  ACE(D  Vlzi,di,pi)  <  1.00 
0.93  <  ACE(Z  Vlzi,di,pi)  <  1.00 
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At  least  93%  of  the  people  in  the  plaintiff’s  subpopulation  would  not  have  devel¬ 
oped  ulcers  had  they  not  been  encouraged  to  take  PeptAid  (20))  or  similarly,  had 
they  not  taken  PeptAid  (do).  This  lends  very  strong  support  for  the  plaintiff’s 
claim  that  he  was  adversely  affected  by  the  marketer  and  producer’s  actions  and 
product. 

The  judge  ruled  in  favor  of  the  plaintiff.  PeptAid  withdrew  the  product  from 
the  market,  and  initiated  a  research  effort  to  identify  observable  characteristics 
of  those  individuals  who  are  adversely  effected  by  PeptAid. 

One  might  be  curious  about  the  distribution  of  consumption  and  response 
behaviors,  Pird^Ty),  that  would  be  responsible  for  such  a  peculiar  story.  Given 
the  distribution  over  the  observables  {Z,D,Y},  we  can  evaluate  bounds  on  each 
individual  combination  of  consumption  and  response  behaviors,  leading  to  the 
identification  of  four  common  behaviors  in  the  population: 


0.16  <  qio  <  0.17 
0.13  <  911  <  0.14 
0.31  <  922  <  0.32 
0.31  <  923  <  0.32 

Essentially,  this  tells  us  that  about  1/3  of  the  population  consists  of  individuals 
who  would  consume  PeptAid  (di)  if  and  only  if  they  received  a  sample  in  the  mail 
(21).  Of  this  sub-population  (r^i),  about  half  of  them  would  never  develop  ulcers 
(ryo),  while  the  other  half  would  develop  ulcers  (j/i)  if  and  only  if  they  consumed 
PeptAid  (ryi). 

The  other  2/3  of  the  population  consists  of  individuals  who  would  consume 
PeptAid  if  and  only  if  they  do  not  receive  a  sample  in  the  mail.  Of  this  sub¬ 
population  {rd2),  about  half  of  them  develop  ulcers  if  and  only  if  they  do  not 
consume  PeptAid  (ry2),  while  the  other  half  would  always  develop  ulcers  (r^s). 
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CHAPTER  8 


Policy  Analysis  in  Linear  Models 

8.1  Introduction 

Counterfactual  thinking  dominates  reasoning  in  political  science  and  economics. 
We  say,  for  example,  “If  Germany  were  not  punished  so  severely  at  the  end  of 
World  War  I,  Hitler  would  not  have  come  to  power,”  or  “If  Reagan  did  not  lower 
taxes,  our  deficit  would  be  lower  today.”  Such  thought  experiments  emphasize 
an  understanding  of  generic  laws  in  the  domain  and  are  aimed  toward  shaping 
future  policy  making,  for  example,  “defeated  countries  should  not  be  humiliated,” 
or  “lowering  taxes  (contrary  to  Reaganomics)  tends  to  increase  national  debt.” 

Strangely,  there  is  very  little  formal  work  on  counterfactual  reasoning  or  pol¬ 
icy  analysis  in  the  behavioral  science  literature.  An  examination  of  a  number  of 
econometric  journals  and  textbooks,  for  example,  reveals  an  imbalance:  while  an 
enormous  mathematical  machinery  is  brought  to  bear  on  problems  of  estimation 
and  prediction,  policy  analysis  (which  is  the  ultimate  goal  of  economic  theories) 
receives  almost  no  formal  treatment.  Currently,  the  most  popular  methods  driv¬ 
ing  economic  policy  making  are  based  on  so-called  reduced-form  analysis:  to  find 
the  impact  of  a  policy  involving  decision  variables  X  on  outcome  variables  F, 
one  examines  past  data  and  estimates  the  conditional  expectation  E{Y\X=x), 
where  x  is  the  particular  instantiation  of  X  under  the  policy  studied. 

The  assumption  underlying  this  method  is  that  the  data  were  generated  un¬ 
der  circumstances  in  which  the  decision  variables  X  act  as  exogenous  variables, 
that  is,  variables  whose  values  are  determined  outside  the  system  under  analysis. 
However,  while  new  decisions  should  indeed  be  considered  exogenous  for  the  pur¬ 
pose  of  evaluation,  past  decisions  are  rarely  enacted  in  an  exogenous  manner.^ 

^This  distinction  is  often  blurred  in  the  literature.  [DS93],  for  example,  state;  “A  variable  is 
considered  exogenous  to  a  system  if  its  value  is  determined  outside  the  system,  either  because 
we  can  control  its  value  externally  (e.g.,  the  amount  of  taxes  in  a  macro-economic  model)  or 
because  we  believe  that  this  variable  is  controlled  externally  (like  the  weather  in  a  system 
describing  crop  yields,  market  prices,  etc.)”  Still,  our  ability  to  externally  control  the  value 
of  a  variable  X  does  not  render  X  exogenous  for  the  purpose  of  legitimizing  the  reduced  form 
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Almost  every  realistic  policy  (e.g.,  taxation)  imposes  control  over  some  endoge¬ 
nous  variables,  that  is,  variables  whose  values  are  determined  by  other  variables 
in  the  analysis.  Let  us  take  taxation  policies  as  an  example.  Economic  data  are 
generated  in  a  world  in  which  the  government  is  reacting  to  various  indicators 
and  various  pressures;  hence,  taxation  is  endogenous  in  the  data-analysis  phase 
of  the  study.  Taxation  becomes  exogenous  when  we  wish  to  predict  the  impact 
of  a  specific  decision  to  raise  or  lower  taxes.  The  reduced-form  method  is  valid 
only  when  past  decisions  are  nonresponsive  to  other  variables  in  the  system,  and 
this,  unfortunately,  eliminates  most  of  the  interesting  control  variables  (e.g.,  tax 
rates,  interest  rates,  quotas)  from  the  analysis.^ 

This  difficulty  is  not  unique  to  economic  or  social  policy  making;  it  appears 
whenever  one  wishes  to  evaluate  the  merit  of  a  plan  on  the  basis  of  the  past 
performance  of  other  agents.  Even  when  the  signals  triggering  the  past  actions  of 
those  agents  are  known  with  certainty,  a  systematic  method  must  be  devised  for 
selectively  ignoring  the  influence  of  those  signals  from  the  evaluation  process.  In 
fact,  the  very  essence  of  evaluation  is  having  the  freedom  to  imagine  and  compare 
trajectories  in  various  counterfactual  worlds,  where  each  world  or  trajectory  is 
created  by  a  hypothetical  implementation  of  a  policy  that  is  free  of  the  very 
pressures  that  compelled  the  implementation  of  such  policies  in  the  past. 

This  chapter  will  present  an  example  of  counterfactual  analysis  in  the  area 
of  econometrics,  where  apparently  no  adequate  formalism  for  dealing  with  policy 
analysis  has  been  proposed.  In  contrast  to  reduced-form  analysis,  our  method 
allows  evaluation  of  the  consequences  of  intervening  on  economic  attributes  that 
are  endogenous  in  normal  operation  only  to  become  exogenous  for  the  purpose 

analysis:  for  E[Y\X  =  x]  to  represent  the  impact  of  A  =  x  on  y,  X  must  also  be  independent 
of  all  implicit  factors  (disturbance  terms)  affecting  Y. 

While  every  economist  knows  that  this  disturbance-independence  is  a  necessary  condition 
for  consistent  estimation  of  structural  parameters,  most  economists  assume  that  disturbance- 
independence  is  a  guaranteed  property  of  controllable  policy  variables.  A  popular  textbook 
[Int78],  for  example,  mentions  these  two  properties  as  if  they  were  synonymous:  “The  exogenous 
variables  are  variables  the  values  for  which  are  determined  outside  the  model  but  which  influence 
the  model.  From  a  formal  standpoint  the  exogenous  variables  are  assumed  to  be  statistically 
independent  of  all  stochastic  disturbance  terms  of  the  model,  while  the  endogenous  variables  are 
not  statistically  independent  of  those  terms.  ...  In  general  the  exogenous  variables  are  either 
historically  given,  policy  variables,  or  determined  by  some  separate  mechanism.” 

^This  problem  is  unrelated  to  the  celebrated  Lucas’s  critique  [Luc76]  which  concerns  param¬ 
eter  changes  due  to  economic  agents  becoming  aware  of  interventions.  The  failure  of  reduced- 
form  analysis  extends  to  physical  systems  as  well,  where  there  are  no  rational  agents  to  speak 
of,  and  where  system  parameters  remain  unaltered  (except  those  under  direct  control). 


137 


of  evaluation.  The  general  techniques  developed  in  Section  2.8.2  will  be  demon¬ 
strated  in  Section  8.2  by  evaluating  the  effect  on  the  demand  for  some  commodity 
when  a  government  imposes  price  controls  on  that  commodity  for  the  first  time. 


8.2  Example 

Consider  an  econometric  structural  equation  model  described  in  [Gol92] 

q  =  bip  +  dii  +  ui  (8.1) 

p  —  b2q  +  d^w  -f-  U2  (8.2) 

where,  q  =  the  quantity  of  household  demand  for  product  A,  p  =  unit  price  of 
product  A,  f  =  household  income,  w  =  wage  rate  for  producing  product  A,  = 
demand  shock,  and  U2  —  supply  shock. 

We  extend  this  model  by  incorporating  an  additional  variable  r  =  household 
demand  for  some  substitute  product  B,  along  with  its  structural  equation 

r  =  63P  -I-  U3 

As  an  example,  B  could  stand  for  tea  and  A  for  coffee. 

Consider  the  following  set  of  counterfactual  queries: 

1.  What  would  be  the  expectation  of  demand  for  coffee  [q)  had  we  intervened 
to  force  coffee  prices  (p)  to  some  predetermined  value,  say  p  =  7? 

2.  What  would  be  the  expectation  of  demand  for  coffee  {q)  had  we  intervened 
to  force  coffee  prices  (p)  to  some  predetermined  value,  say  p  =  7,  and  then 
observed  the  demand  for  tea  (r)  to  be  r  =  4? 

3.  Given  that  presently  the  demand  for  tea  (r)  is  r  =  4,  what  would  be  the 
expectation  of  demand  for  coffee  (q)  had  we  intervened  to  force  coffee  prices 
(p)  to  some  predetermined  value,  say  p  —  71 

Note  the  difference  between  queries  number  2  and  3.  Number  2  states  that  the 
price  intervention  occurs  prior  to  our  observation  of  Product  B’s  demand,  while 
number  3  states  that  we  first  make  an  observation  of  Product  B’s  demand  and 
then  intervene  to  force  Product  A’s  price. 

The  above  counterfactual  queries  only  involve  the  variables  X  —  [P,Q,R]', 
therefore,  we  may  marginalize  out  all  remaining  variables  in  Eqs.  (8.1)  and  (8.2), 
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only  retaining  the  distributions  on  P,  Q,  and  R's  disturbance  terms.  Because 
/  and  W  are  exogenous  (root)  variables  in  the  structural  equations,  we  may 
combine  I  and  Ui  into  one  disturbance  variable  e,.  Likewise,  W  and  U2  may  be 
combined  into  one  disturbance  variable  Cp.  The  structural  equations  for  analyzing 
the  above  counterfactual  queries  may  be  reduced  to 


(8.3) 

The  causal  structure  for  this  model  is  shown  in  Figure  8.1. 


Product  A  Product  B 

Demand  Demand 


Figure  8.1:  Causal  structure  of  an  econometric  model  relating  the  demand  for  two 
products  A  and  B  and  the  price  of  product  A.  The  variables  are  related  according 
to  the  linear  structural  equations  given  in  Eq.  8.3,  where  the  disturbances,  Cp,  tq, 
and  tr  are  independent  and  normally  distributed. 


X  =  Bx  +  e 


p 

'  0 

62 

0  ‘ 

p 

<1 

= 

h 

0 

0 

q 

+ 

r 

^3 

0 

0 

r 

Because  R  and  Q  are  d-separated  ([Pea88])  by  P  when  the  arrow  Q  — * 
P  is  removed,  the  observation  of  R  after  P’s  intervention  has  no  impact  on 
the  evaluation  of  Q’s  distribution.  Therefore,  the  counterfactual  distribution  of 
demand  for  coffee  {Q)  will  be  the  same  as  evaluated  from  queries  number  1  and 
2. 

Suppose  that  the  parameters  for  this  model  are  given  by: 


B 


0  0.50  0 
-1.80  0  0 
1.00  0  0 

0  19.00  3.00 
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■  1.00  0  0  ’ 

S,,,  =  0  3.00  0 

[  0  0  2.00 

which  reflects  the  following  prior  distribution  on  X  =  [P,  Q,  i?]: 

=  [  5.00  10.00  8.00  ' 


0.48 

-0.08 

0.48 

0.08 

1.73 

-0.08 

0.48 

-0.08 

2.48 

The  expected  price  of  coffee  is  $5.00,  while  the  average  demands  for  coffee  and 
tea  are  10  and  8  units,  respectively. 

The  first  query  above  is  interested  in  determining  the  distribution  of  demand 
for  coffee  (Q)  given  that  no  observations  have  been  made  on  the  system,  if  we 
had  intervened  to  force  the  price  of  coffee  to  $7.00.  Evaluating  the  expressions 
in  Eqs.  (2.14)-(2.15),  we  arrive  at  the  following  distribution: 

PUp=7  =  [  7.00  6.40  10.00  j  (8.4) 

'  0  0  o' 

^x*|p=7  ~  0  3.00  0 

[  0  0  2.00 

We  conclude  that  the  average  household  demand  for  coffee  and  tea  would  have 
been  6.4  and  10  units,  respectively,  if  the  price  of  coffee  were  $7.00. 

The  third  question  asks  what  would  have  been  the  distribution  of  demand  for 
coffee  (Q),  if  the  price  of  coffee  were  controlled  to  $7.00,  given  that  demand  for 
tea  is  currently  4  units.  Applying  the  expressions  in  Eqs.  (2.14)-(2.15): 

/4-|p=v=4  =  [  7.00  5.13  6.78]  (8.5) 

■  0  0  o' 

^a:*|p=7,r=4  =  0  2.75  —0.64 

[  0  -0.64  0.39 

Note  the  importance  of  the  observation  of  demand  for  tea  (R).  In  the  first  query 
we  found  that  forcing  the  price  of  coffee  (P)  to  $7.00  will  reduce  the  expected 
demand  for  coffee  ((5)  from  10  units  to  6.4  units.  The  observation  of  4  unit 
demand  for  tea  changes  our  belief  in  the  expected  value  of  the  demand  for  coffee  to 
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jWg|r=4  =  10.13  units;  if  we  intervene  to  force  the  price  of  coffee  $7.00,  the  expected 
demand  for  coffee  (Q)  will  be  reduced  from  10.13  to  5.13  units.  Therefore,  we  see 
that  enforcing  price  control  on  coffee  would  have  had  a  more  adverse  affect  on 
the  demand  for  coffee  under  the  knowledge  that  the  demand  for  tea  was  only  4 
units.  In  addition,  the  expected  household  demand  of  tea  would  have  been  6.78 
units  rather  than  the  observed  4  units. 

If  we  believe  that  the  disturbance  on  the  demand  for  coffee  (e,)  slowly  change, 
or  at  least  change  infrequently,  then  we  can  use  the  results  of  this  counterfact ual 
distribution  to  determine  whether  price  controls  should  now  be  imposed  to  meet 
our  needs.  In  other  words,  the  counterfactual  distribution  will  tell  us  how  we 
expect  variables’  distributions  to  change  as  a  result  of  an  external  intervention 
applied  in  the  present. 

It  is  important  to  note  the  difference  between  counterfactual  distributions 
(conditioned  on  observations  and  external  intervention)  and  distributions  simply 
conditioned  on  observations.  Consider  the  distribution  that  would  be  computed 
from  observing  the  price  of  coffee  at  $7.00  (p  =  7),  or  from  observing  the  demand 
for  tea  at  4  units  and  the  coffee  price  at  $7.00  (r  =  4  and  p  =  7): 


^x,x\p=7  — 


A^a:|r=4,p=7 
<^j:,i|r=4,p=7  = 


[  7.00  9.66  10.00 

‘  0  0  o' 

0  1.71  0 

0  0  2.00 

7.00  9.66  4.00  ] 

'  0  0  0  ' 

0  1.71  0 

0  0  0 


(8.6) 

(8.7) 

(8.8) 

(8.9) 


Contrast  the  expected  household  demand  for  coffee  evaluated  from  these  condi¬ 
tional  distributions  to  those  counterfactual  distributions  where  the  price  of  coffee 
(P)  has  been  forced  by  external  intervention.  In  particular,  compare  Eq.  (8.6) 
to  Eq.  (8.4)  and  Eq.  (8.8)  to  Eq.  (8.5).  This  should  convince  the  reader  that  it 
is  incorrect  to  use  distributions  conditioned  on  observations  for  evaluating  (eco¬ 
nomic)  policies,  because  they  fail  to  capture  the  change  in  value  of  the  variable 
that  will  undergo  external  intervention.  The  expected  value  of  that  variable  prior 
to  intervention  is  important  for  properly  evaluating  the  effect  of  that  intervention. 
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8.3  Conclusion 


This  chapter  has  addressed  the  inadequacy  of  current  techniques  in  economet¬ 
rics  and  the  social  sciences  for  evaluating  the  potential  effects  of  economic  and 
social  policies.  Current  techniques  fail  to  correctly  evaluate  policies  that  control 
endogenous  variables,  that  is,  variables  that  are  influenced  by  other  variables  in 
the  system  prior  to  enacting  the  policy. 

This  deficiency  has  been  addressed  by  applying  the  formalism  for  evaluat¬ 
ing  counterfactual  conditionals  in  linear  structural  equation  models  described  in 
Section  2.8.2.  This  method  is  applicable  to  the  analysis  of  policies,  even  when 
the  policy  dictates  intervention  on  an  endogenous  variable.  An  example  was 
presented  that  demonstrates  the  disparity  between  analyses  based  on  counterfac- 
tuals  and  reduced-form  analysis  which  treats  intervention  as  an  observation  on 
controlled  variables. 
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CHAPTER  9 


Conclusion 


Counterfactual  reasoning  is  common  to  everyday  discourse,  and  is  important  to 
a  broad  range  of  applications  including  liability  litigation  and  policy  analysis. 
Although  closest-world  semantics  and  imaging  provide  a  solid  foundation  for 
analyzing  counterfactual  conditionals,  a  formalization  of  closeness  of  worlds  that 
intuitively  reflects  our  understanding  of  the  mechanisms  that  drive  the  world  has 
not  previously  been  provided.  This  dissertation  has  addressed  this  short-coming 
by  representing  generic  knowledge  of  the  world  by  causal  relationships  and  by 
interpreting  a  counterfactual  antecedent  as  an  external  intervention  that  forces 
the  antecedent  to  be  true  despite  all  known  influences  that  normally  impinge  on 
the  antecedent  variable. 

Under  this  formulation,  the  ability  to  precisely  evaluate  a  counterfactual  prob¬ 
ability  (i.e.,  the  probability  that  the  consequent  would  have  been  true,  if  the 
antecedent  were  true)  is  dependent  on  the  detail  of  causal  knowledge  available. 
While  a  counterfactual  probability  may  be  uniquely  computed  given  a  functional 
model  of  a  system,  only  bounds  on  the  counterfactual  probability  may  be  com¬ 
puted  if  the  causal  relationships  are  parameterized  by  a  probabilistic  specification 
(i.e.,  a  conditional  probability  distribution  for  each  variable  given  an  instantia¬ 
tion  of  its  causal  influences).  Depending  on  the  form  of  a  counterfactual  query 
and  the  causal  structure  of  the  system,  it  is  not  always  possible  to  guarajitee 
the  evaluation  of  bounds  on  a  counterfactual  probability.  However,  it  has  been 
shown  in  this  dissertation  that  the  evaluation  of  bounds  is  guaranteed  for  coun¬ 
terfactual  beliefs  when  the  causal  model  is  parameterized  by  order-of-magnitude 
probabilities. 

Our  formulation  for  interpreting  and  evaluating  counterfactual  probabilities 
has  been  applied  to  the  determination  of  bounds  on  treatment  effects  from  studies 
in  which  subject  compliance  is  imperfect,  resulting  in  tighter  bounds  than  pre¬ 
viously  discovered.  These  results  are  based  on  a  large-sample  approximation;  in 
the  future,  we  would  like  to  explore  the  small-sample  analysis  through  hypothesis 
testing  and  computing  the  distribution  of  bounds. 
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We  have  also  demonstrated  the  potential  power  of  counterfactual  probabilities 
for  determining  liability  in  legal  cases  when  a  causal  formulation  may  be  brought 
to  bear  in  the  case.  Finally,  economic  and  social  policy  analysis  can  also  ben¬ 
efit  from  evaluating  counterfactuals  in  structural  equation  models,  which  allows 
analysts  to  determine  the  effect  of  controlling  endogenous  variables  in  a  system. 
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APPENDIX  A 


Proofs 

A.l  Sufficiency  of  P  space  constraints 

In  this  section,  we  will  prove  that  any  distribution  in  observation  space,  P,  which 
satisfies  the  constraints  given  in  equation  5.13  can  be  modeled  by  the  latent 
structure  given  in  figure  5.1. 

The  key  to  the  proof  is  to  show  that  there  is  a  one-to-one  mapping  between 
the  extreme  points  in  the  observation  space  constrained  by  equation  5.13  and  the 
extreme  points  in  a  transformed  parameter  space  of  the  counterfactual  model. 
This  can  easily  be  accomplished  by  using  an  algorithm  for  enumerating  all  vertices 
in  a  convex  polytope. 

Theorem  A.  1.1  [Sufficiency  of  P  space  constraints]  Satisfaction  of  the  con¬ 
straints: 


Pii.i  +  poi.o  <  1 
Poi.i  +  Pii.o  <  1 
Pio.i  +  Poo.o  <  1 
Poo.i  +  Pio.o  ^  1 

is  sufficient  to  guarantee  that  the  latent  structure  in  figure  5.1  can  model  a  point 
in  the  probabilistic  observation  space  p. 

Proof: 

The  full  set  of  linear  constraints  (including  those  imposed  by  probability 
theory)  which  define  the  above  P  space  is  given  by 

Pn.i+Poi.o  <  1 
Poi.i  +  Pii.o  ^  1 

Pio.i  +  Poo.o  <  1 
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Poo.i  +  Pio.o  <  1 
Poo.l  +  Pol.l  +  PlO.l  +  Pll.l  =  1 
Poo.o  +  Pol.o  +  PlO.O  +  Pll.o  =  1 

Poo.o,Poi.OiPio.O)Pn.o,  Poo.l?  Pol.i,  PlO.l)  Pll.l  ^  0 

The  extreme  vertices  within  this  closed  polytope  may  be  enumerated  by 
one  of  many  vertex  enumeration  algorithms  (for  example,  [Mat73]) 

Pi  =  (1,0, 0,0, 1,0, 0,0) 

p2  =  (1,0, 0,0, 0,1, 0,0) 

P3  =  (1,0,0, 0,0,0, 0,1) 

p,  =  (0,1, 0,0, 1,0, 0,0) 

P5  =  (0,1, 0,0, 0,1, 0,0) 

Pe  =  (0,1,0,0,0,0,1,0) 

Pr  =  (0,0, 1,0, 0,1, 0,0) 

P8  =  (0,0,1,0,0,0,1,0) 

P9  =  (0,0, 1,0, 0,0, 0,1) 

Pio  =  (0,0, 0,1, 1,0, 0,0) 

Pn  =  (0,0, 0,1, 0,0, 1,0) 

=  (0,0, 0,1, 0,0, 0,1) 

where 


P  =  (PoO.0)Poi.O)Pl0.0)Pll.O)P0O.l)P01.1)Pl0.1)Pll.l) 

The  transformation  from  Q  space  to  P  space  (Eq.  5.25)  was  explicated  in 
Section  5.2.2.  There  are  four  pairs  of  Q  space  parameters  each  of  which 
always  occur  in  combination  within  this  transformation;  therefore  we  can 
reduce  the  Q  space  to  a  12  dimensional  space,  V ,  where  V  and  Q  are  related 
as  follows: 


=  qoo  +  9oi 
V2  =  qo2  +  qo3 

V3  =  qio 

V4  =  qn 

V5  =  912 
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^^6  =  ?13 

V7  =  920 
Vs  =  921 
Vg  =  922 

VlO  =  923 

^11  =  930  +  932 

V12  =  931  +  933- 

The  V  and  P  spaces  are  then  related  by  the  following  equations: 

Poo.o  =  Ui  +  1^3  +  ^4 
Poi.o  =  V7  +  Vg  +  Vii 
PlO.O  =  1^2  +  V5  +  ^6 

Pll.O  =  ^8  +  Vio  +  V12 

(A.l) 

Poo.i  —  ui  +  U7  +  ug  (A-2) 

Pol.l  =  Vs  +  ^5  +  Vll 
PlO)l  =  U2  +  U9  +  1^10 
Pii.i  =  V4  +  ve  +  V12 

If  we  constrain  V  by  probability  theory  Vi  =  1  and  >  0,  i  =  1 . . .  12) 
we  obtain  twelve  extreme  vertices  corresponding  to  the  points  where  exactly 
one  of  the  u,-  is  equal  to  1.0  and  all  others  are  zero. 

Eq.  A. 2  provides  a  one-to-one  mapping  between  these  twelve  V  space  ver¬ 
tices  and  the  twelve  vertices  in  the  constrained  P  space.  Because  the  linear 
transformation  maps  the  extreme  vertices  of  the  V  space  to  the  extreme 
vertices  of  the  P  space,  then  for  every  point  p  in  the  constrained  P  space, 
there  exists  a  point  in  V  (and  hence  Q)  space  that  models  p.  Since  the 
latent  structure  model  (Figure  5.1)  subsumes  the  response-function  model, 
any  point  in  P  which  satisfies  the  constraints  given  by  equation  5.13  can 
be  modeled  using  the  latent  structure. 

□ 
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APPENDIX  B 


Closed-form  solutions  to  linear  optimization 


In  general,  a  linear  optimization  problem  may  be  specified 
tion  to  minimize 


by  an  objective  func- 


min  c^x 

along  with  a  set  of  linear  constraints  that  must  be  satisfied: 


(B.l) 


Ax  >  b 
X  >  0 


(B.2) 

(B.3) 


w  ere  X  ,s  a  matrix  of  coefficients,  c  is  a  vector  of  coefficients  acting  on  the 
variable  vector  x,  and  6  is  a  vector  of  constants.  Given  4,  c.  and  b,  there  are 
many^gor,  hms  that  will  return  a  value  for  the  vector  x  that  globally  minimizes 
c  re  subject  to  the  specified  linear  constraints  [Had62]  [Dan63]. 

Sometimes,  though,  it  is  desirable  to  derive  a  closed-form  expression  for 
mincx  for  Ji  possible  values  of  the  constraint  vector  b.  The  procedure  for  de- 
riving  this  closed-form  solution  is  tied  to  the  enumeration  of  all  extreme  vertices 
in  the  dual  linear-programming  problem. 

The  dual  of  the  above  minimization  problem  is  given  by  the  objective  function: 


subject  to  the  constraints 


maxj/‘6 


y^A  <  c* 

y  >  0 


(B.4) 

(B.5) 


by 


It  is  known  [Jac93]  that  the  expression  for  the  minc'rr 


in  terms  of  6  is  given 


minc^x  =  max  yjb 


(B.6) 
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where  {yi|i  =  0, . . . ,  if}  is  the  set  of  y  such  that  each  y,-  maximizes  y^b  for  some 
value  of  b.  This  set  of  y  is  exactly  the  set  of  extreme  vertices  in  the  constraint 
space  given  by  Eqs.  (B.4)  and  (B.5) 

y*A  <  c‘ 
y  >  0 

Therefore,  to  generate  the  general  solution  to  the  minimization  problem  given 
by  Eqs.  (B.1)-(B.3)  we  merely  enumerate  all  vertices  in  the  constraint  space  of 
the  dual  linear-programming  problem  (Eqs.  (B.4)  and  (B.5))  and  substitute  into 
Eq.  (B.6). 

A  review  of  some  vertex-enumeration  algorithms  may  be  found  in  [MR80], 
of  which  the  algorithm  by  Mattheiss  [Mat73]  was  implemented  to  derive  the 
solutions  presented  in  this  dissertation. 

In  order  to  apply  this  procedure  for  deriving  a  closed-form  solution  to  a  linear- 
optimization  problem,  the  constraints  must  be  transformed  to  >  relations.  Many 
of  the  constraints  imposed  for  deriving  bounds  on  counterfactual  probabilities 
are  in  the  form  of  equalities.  The  presence  of  equality  constraints  indicates  that 
there  are  fewer  degrees  of  freedom  in  the  problem  space  than  the  number  of 
variables  suggests;  these  equalities  will  be  used  to  eliminate  variables  from  the 
linear-optimization  problem.  For  example,  suppose  that  the  following  constraints 
exist  for  the  two  variables  a  and  b: 

ct  -f*  6  —  1 
a  >  0 
6  >  0 

and  the  expression  to  be  optimized  is 

2a  —  b 

The  equality  relation  allows  us  to  write  a  in  terms  of  the  remaining  variable 

a  =  1  —  6 

which  may  then  be  used  to  eliminate  a  from  the  linear  programming  problem. 
The  constraints  then  become 

1-6  >  0 
6  >  0 

while  the  objective  function  becomes 

2-36 
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B.l  Example 


Consider  the  derivation  of  bounds  for  average  treatment  effect  in  Section  5.2.2;  the 
objective  function  to  optimize  was  given  by  Eq.  (5.23)  and  the  linear  constraints 
were  given  by  Eq.  (5.26). 

The  equality  constraints  in  this  specification  allow  us  to  eliminate  seven  of 
the  variables  goo,  9io,  920, 9ii,  921,  go2, 912  resulting  in  the  following  seven  non-trivial 
inequality  constraints: 


930  ~  9oi  +  931  +  922 
—930  —  922 


+  932  +  923  +  933  ^ 

~  932  +  9l3  ~  923  ^ 

—930  —  922  —  932  > 
~931  ~  9l3  —  933  ^ 
—931  —  923  —  933  > 
—922  —  903  “  923  ^ 
922  —  9l3  +  923  > 


POl.O  +  Pll.o  ~  Poo.l 

PlO.O  —  Pol.l  —  Pio.l 

■“POl.O 

-Pll.l 

-Pll.O 

-PlO.l 

PlO.l  —  PlO.O 


The  objective  function  to  be  minimized  ACE(D  — »■  Y)  may  also  be  rewritten  in 
terms  of  the  remaining  variables: 


Pii.o  +  Pii.i  ~  PlO.O  +  9oi  ~  931  “■  922  ~  932  +  903  ~  923  ~  2g33 

Before  this  expression  is  minimized  the  constant  terms  (pn.o  +  Pn.i  —  Pio.o)  will 
be  dropped.  After  the  minimization  is  complete  these  terms  will  be  reattached. 
Therefore,  the  expression  to  be  optimized  by  linear  programming  is  given  by: 

9oi  ~  931  ~  922  ■“  932  +  903  ~  923  ~  2g33 


In  terms  of  Eqs:  (B.1)-(B.5),  this  task  may  be  specified  by  the  following 
matrices. 


1 

-1 

1 

1 

1 

0 

0 

1 

1 

-1 

0 

0 

-1 

-1 

0 

1 

-1 

0 

-1 

0 

0 

-1 

-1 

0 

0 

0 

0 

0 

0 

-1 

0 

0 

0 

-1 

0 

-1 

0 

0 

-1 

0 

0 

0 

0 

-1 

-1 

0 

0 

0 

-1 

0 

-1 

0 

-1 

0 

0 

0 

0 

1 

0 

0 

-1 

1 

0 
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b  = 


Pol.o  +  Pll.O  ~  Poo.i 

PlO.O  ~  POl.l  ~  PlO.l 
— POl.O 

-Pn.i 
—Pn.o 
— Pio.i 
Pio.i  ~  Pio.o 

930  9oi  931  922  932  903  9l3  923  933 

[01-1-1-110-1-2' 


y 


t 


yi  y2  j/3  1/4  j/5  ye  yi 


Applying  a  vertex  enumeration  algorithm  to  the  dual  linear-programming 
problem’s  constraint  space  leads  to  the  following  list  of  extreme  vertices: 

y*  =  [  0  0  1  2  0  1  0  ] 

y*  =  [0010200] 

y‘  =  [  0  1  0  0  2  1  1  ] 

y*  =  [0102000] 

y‘  =  [  0  1  0  1  1  0  0  ] 

=  [  0  0  1  1  1  0  0  ] 

y<  =  [0202000] 

y‘  =  [  0  0  2  0  2  0  1  ] 

Substituting  these  vertices  into  Eq.  (B.6)  and  then  adding  back  the  previ¬ 
ously  dropped  terms  (pn.o  +  P11.1  —  Pio.o)  will  produce  the  same  results  given  by 
Eq.  (5.29). 


B.2  Program  Implementation 

At  the  UCLA  Cognitive  Systems  Laboratory,  we  have  implemented  a  program 
for  deriving  closed-form  solutions  to  these  linear  optimization  problems.  This 
program  accepts  a  text  file  specifying  the  optimization  problem  to  solve,  and 
enumerates  all  terms  in  the  minimization  (maximization)  set  that  make  up  the 
close- form  solution  for  the  maximum  (minimum)  of  the  specified  objective  func¬ 
tion. 
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B.2.1  Input  Text  File 


The  input  text  file  is  composed  of  several  sections  (the  order  being  irrelevant), 
each  delimited  by  a  header.  Blank  lines  may  be  used  to  separate  lines  in  the  text 
file.  In  addition,  comments  may  be  entered  on  lines  by  placing  a  single  ‘%’  before 
free-form  text.  If  an  equation  or  expression  is  too  long  to  fit  on  a  single  line, 
you  may  break  the  line  by  placing  a  backslash  (‘\’)  at  the  end  of  the  line,  and 
continuing  the  expression  on  the  following  line.  This  may  be  repeated  to  extend 
an  expression  over  several  lines.  Symbol  (variable  or  parameter)  names  must 
begin  with  an  alphabetic  character  followed  by  alphabetic,  numeric,  or  underscore 
characters  (e.g.,  a_01,  Long.Name^2)  and  must  not  exceed  20  characters  in 
length.  The  following  paragraphs  explain  the  format  of  each  section  in  the  text 
file. 

VARIABLES  This  section  lists  the  variables  that  correspond  to  the  vector  x  in 
Eq.  (B.l).  When  bounding  counterfactual  probabilities,  the  values  of  these 
variables  specify  the  distribution  of  response-functions,  e.g.,  the  Q  space 
parameters  in  Chapter  5.  Only  one  variable  may  be  listed  per  line,  and  the 
VARIABLES  keyword  must  appear  on  a  line  by  itself. 

PARAMETERS  This  section  lists  the  parameters  that  correspond  to  the  vec¬ 
tor  b  in  Eq.  (B.2).  When  bounding  counterfactual  probabilities,  these  corre¬ 
spond  to  the  conditional  probability  distributions  over  the  observable  vari¬ 
ables,  e.g.,  the  P  space  parameters  in  Chapter  5.  Only  one  parameter  may 
be  listed  per  line,  and  the  PARAMETERS  keyword  must  appear  on  a  line 
by  itself. 

CONSTRAINTS  This  section  lists  the  constraints  imposed  on  the  variables  by 
the  parameters.  These  constraints  may  be  written  as  =,  >,  or  <  relations. 
The  constraints  must  be  linear,  i.e.,  plus,  minus,  and  real  coefficients.  There 
is  no  requirement  as  to  the  placement  of  variables  versus  parameters  or  ad¬ 
ditive  constants.  Suppose  that  the  set  of  variables  is  given  by  {A,  V,  Z},  and 
the  set  of  parameters  is  given  by  {A,B,C},  then  the  following  constraints 
satisfy  the  required  format: 


-A  +  8.9X  -4.W  <  (7  +  4.678 
A  +  B  =  0.5A 
A  +  r  >  1  +  Z 
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Nonnegativity  constraints  are  assumed  for  all  variables,  e.g., 


X  >  0 

Y  >  0 

Only  one  constraint  may  be  listed  per  line  (although  it  may  extend  over 
several  lines  using  a  backslash  at  the  end  of  each  incomplete  line),  and  the 
CONSTRAINTS  keyword  must  appear  on  a  line  by  itself. 

MINIMIZE/MAXIMIZE  This  keyword  indicates  whether  the  objective  func¬ 
tion  is  to  be  minimized  or  maximized,  respectively.  This  keyword  must 
appear  on  a  line  by  itself. 

OBJECTIVE  The  objective  to  be  optimized  must  be  an  expression  that  is  a 
linear  function  of  the  parameters,  variables,  and  real  constants.  For  exam¬ 
ple, 


2A  +  X  +  O.sr  -  6.7 


END  The  specification  of  the  optimization  problem  must  be  terminated  by  the 
END  keyword  alone  on  a  separate  line. 


An  example  of  a  complete  input  file  used  for  obtaining  the  results  in  Chapter  5 
follows; 


y.  This  linear  optimization  specification  will  be  used 
y  to  generate  the  lower  bounds  on  the  average  treatment 
y,  effect  for  experimental  studies  where  subject  compliance 
%  is  not  perfect. 

VARIABLES 


qOO 

qOl 

q02 

q03 


qlO 

qll 

ql2 
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ql3 


q20 

q21 

q22 

q23 

q30 

q31 

q32 

q33 

PARAMETERS 

p00_0 

p01_0 

pl0_0 

pll_0 

pOO.l 

p01_l 

pio.l 

pll_l 

CONSTRAINTS 

1  =  qOO  +  q_01  +  q_02  +  q_03  +  \ 

qlO  +  q_ll  +  q_12  +  q_13  +  \ 

q20  +  q_21  +  q_22  +  q_23  +  \ 

q30  +  q_31  +  q_32  +  q_33 

p00_0  =  qOO  +  qOl  +  qlO  +  qll 

pOl.O  =  q20  +  q22  +  q30  +  q32 

plO_0  =  q02  +  q03  +  ql2  +  ql3 

pll.O  =  q21  +  q23  +  q31  +  q33 

pOO.l  =  qOO  +  qOl  +  q20  +  q21 

pOl.l  =  qlO  +  ql2  +  q30  +  q32 

plO.l  =  q02  +  q03  +  q22  +  q23 

pll_l  =  qll  +  ql3  +  q31  +  q33 
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MINIMIZE 


OBJECTIVE 

qOl  +  qll  +  q21  +  q31  -  q02  -  ql2  -  q22  -  q32 


END 

B.2.2  Program  Output 

The  output  of  the  program  will  first  redisplay  the  problem  specification  in  a 
canonical  form.  It  will  then  display  a  set  of  expressions  that  are  to  be  minimized 
or  maximized  depending  on  whether  the  objective  function  was  to  be  maximized 
or  minimized,  respectively.  These  expressions  will  be  linear  functions  of  the 
specification  parameters  (not  the  variables)  and  a  real  constant. 

For  example,  the  output  to  the  specification  file  shown  above  will  be: 
Constraints : 

qOO  +  q_01  +  q_02  +  q_03  +  qlO  +  q_ll  +  q_12  +  q_13  +  q20  +  \ 
q_21  +  q_22  +  q_23  +  q30  +  q_31  +  q_32  +  q_33  -1=0 
p00_0  -  qOO  -  qOl  -  qlO  -  qll  =  0 

p01_0  -  q20  -  q22  -  q30  -  q32  =  0 

pl0_0  -  q02  -  q03  -  ql2  -  ql3  =  0 

pll_0  -  q21  -  q23  -  q31  -  q33  =  0 

p00_l  -  qOO  -  qOl  -  q20  -  q21  =  0 

pOl.l  -  qlO  -  ql2  -  q30  -  q32  =  0 

pl0_l  -  q02  -  q03  -  q22  -  q23  =  0 

pll_l  -  qll  -  ql3  -  q31  -  q33  =  0 

Minimize  Objective: 

qOl  +  qll  +  q21  +  q31  -  q02  -  ql2  -  q22  -  q32 

Solution: 

MAX 

{ 

pill  +  pOOO  -  1, 
plio  +  pOOl  -  1, 

-  pOll  -  plOl, 
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-  pOlO  -  ploo, 

plio  -  pill  -  pioi  -  pOlO  -  plOO, 
pill  -  pllO  -  ploo  -  pOll  -  plOl, 
pOOl  -  pOll  -  plOl  -  pOlO  -  pOOO, 
pOOO  -  pOlO  -  ploo  -  pOll  -  pOOl 
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